Taipei House Price Statistical Modeling

01 - The Problem

What drives property prices - and can we prove it?

Real estate pricing is driven by multiple factors simultaneously. The challenge isn't just building a model - it's building one you can trust. A model that overfits to 414 data points is useless. So is one that can't tell you which features matter and by how much.

This project focused on statistical rigour: proper diagnostics, regularisation to handle multicollinearity, and resampling methods to quantify uncertainty around every coefficient estimate.

Dataset: 414 real estate transactions from Sindian District, Taipei (UCI repository). Features: house age, distance to nearest MRT station, number of nearby convenience stores, latitude, longitude, transaction date. Target: price per unit area (NT$/Ping).

02 - What I Built

From EDA to validated predictions

Exploratory Data Analysis

Distribution plots, correlation heatmap, scatter matrix. Identified that distance-to-MRT is severely right-skewed - the closest stations dominate. House age shows non-linear price effects.

Log Transform on MRT Distance

Applying log(distance_to_MRT) stabilised variance and made the relationship with price linear - a classic fix that materially improved regression diagnostics and model fit.

OLS Regression + Diagnostics

Built multiple OLS models, tested assumptions via Breusch-Pagan (heteroscedasticity), Shapiro-Wilk (normality of residuals), and VIF scores (multicollinearity). Iterated until diagnostics passed.

LASSO Regularisation

Applied LASSO (L1) to simultaneously regularise and perform feature selection. Used LassoCV to find optimal α via cross-validation. Geographic coordinates were correctly shrunk to near-zero.

10-Fold Cross-Validation

Evaluated generalisation with 10-fold CV - reporting CV MSE ~79 and R² ~0.58. This is the honest performance estimate, not the in-sample R² which always looks better.

Bootstrap Confidence Intervals

1,000 bootstrap samples to estimate 95% confidence intervals for each coefficient. MRT distance coefficient: -0.31 [−0.38, −0.24] - statistically robust across all resamples.

# Bootstrap confidence intervals for regression coefficients
n_bootstrap = 1000
boot_coefs = []
for _ in range(n_bootstrap):
    idx = np.random.choice(len(X), size=len(X), replace=True)
    X_b, y_b = X[idx], y[idx]
    model = LinearRegression().fit(X_b, y_b)
    boot_coefs.append(model.coef_)

boot_coefs = np.array(boot_coefs)
ci_lower = np.percentile(boot_coefs, 2.5, axis=0)
ci_upper = np.percentile(boot_coefs, 97.5, axis=0)
# Result: log(MRT_distance) CI = [-0.38, -0.24]  -  robust negative effect

03 - Key Findings

What drives Taipei house prices

Strongest driver

MRT distance (negative)

Coefficient: −0.31 (log scale). Every doubling of MRT distance reduces price per unit area by ~18%. The effect is non-linear - being near an MRT station matters far more than being slightly further away.

Positive driver

Nearby convenience stores

Each additional nearby store adds approximately 1.1 NT$/Ping. Proxy for urban density and walkability - areas with many stores command a consistent premium.

Negative driver

House age

Older buildings are discounted - approximately −0.27 NT$/Ping per year of age. Effect is more pronounced for properties over 20 years old.

Model performance

CV R² ~0.58

Explains 58% of price variance on held-out data. Solid for a linear model on real estate data - the remaining variance reflects factors not in the dataset (interior quality, floor level, negotiation).

04 - Stack

Technologies Used

Tool	Purpose
Statsmodels	OLS regression, statistical inference, diagnostic tests (BP, SW, VIF)
scikit-learn	LASSO (LassoCV), cross-validation, Logistic Regression for classification
Pandas & NumPy	Data manipulation, bootstrap loop, feature engineering
Matplotlib & Seaborn	Distribution plots, residual plots, coefficient CI visualisations
Dataset	UCI Real Estate Valuation - 414 Taipei transactions

05 - Outcomes

Results

R²~0.58

generalisation via 10-fold cross-validation

1,000

bootstrap samples for coefficient uncertainty

MRT

confirmed as strongest negative price predictor

MRT distance identified as the dominant pricing factor - consistent across OLS, LASSO, and classification models
Log transformation of MRT distance was key - materially improved model diagnostics and fit
Bootstrap CIs confirmed coefficient stability - findings are not artefacts of a particular data split
LASSO correctly zeroed geographic coordinates - they're collinear with MRT distance at this scale

06 - Learnings

What I took away

Cross-validation R² is the honest number - in-sample R² on 414 points can be inflated by a few outliers. Always report held-out performance.
Diagnostic testing isn't optional - heteroscedasticity in the residuals invalidates standard errors and confidence intervals, making the model's uncertainty estimates unreliable.
Bootstrap CIs are more interpretable than p-values for communicating "how sure are we about this coefficient" to a non-statistical audience.
Log transformations on skewed predictors often matter more than model choice - getting the functional form right first, then regularising.

All Projects Jan – Mar 2026 · UC San Diego

Predicting House Pricesin Taipei

What drives property prices - and can we prove it?

From EDA to validated predictions

Exploratory Data Analysis

Log Transform on MRT Distance

OLS Regression + Diagnostics

LASSO Regularisation

10-Fold Cross-Validation

Bootstrap Confidence Intervals

What drives Taipei house prices

MRT distance (negative)

Nearby convenience stores

House age

CV R² ~0.58

Technologies Used

Results

What I took away

Predicting House Prices
in Taipei