From ad1a58c67c0dc38c068f00d51a8d9b83dd2745f3 Mon Sep 17 00:00:00 2001 From: e7thang Date: Fri, 1 May 2026 11:31:44 -0500 Subject: [PATCH 1/3] Revise README for House Price Classification project Updated README to reflect project details on House Price Classification using Random Forest. Added sections on data preprocessing, model evaluation, and repository structure. --- README.md | 222 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 124 insertions(+), 98 deletions(-) diff --git a/README.md b/README.md index 0a56122..adaecbe 100644 --- a/README.md +++ b/README.md @@ -1,121 +1,147 @@ ![](UTA-DataScience-Logo.png) -# Project Title +# House Price Classification using Random Forest on Kaggle Tabular Data -* **One Sentence Summary** Ex: This repository holds an attempt to apply LSTMs to Stock Market using data from -"Get Rich" Kaggle challenge (provide link). +This project applies a Random Forest model to the Kaggle House Prices dataset by transforming continuous sale prices into categorical classes and performing a classification task on tabular housing data. ## Overview - -* This section could contain a short paragraph which include the following: - * **Definition of the tasks / challenge** Ex: The task, as defined by the Kaggle challenge is to use a time series of 12 features, sampled daily for 1 month, to predict the next day's price of a stock. - * **Your approach** Ex: The approach in this repository formulates the problem as regression task, using deep recurrent neural networks as the model with the full time series of features as input. We compared the performance of 3 different network architectures. - * **Summary of the performance achieved** Ex: Our best model was able to predict the next day stock price within 23%, 90% of the time. At the time of writing, the best performance on Kaggle of this metric is 18%. - -## Summary of Workdone - -Include only the sections that are relevant an appropriate. - -### Data - -* Data: - * Type: For example - * Input: medical images (1000x1000 pixel jpegs), CSV file: image filename -> diagnosis - * Input: CSV file of features, output: signal/background flag in 1st column. - * Size: How much data? - * Instances (Train, Test, Validation Split): how many data points? Ex: 1000 patients for training, 200 for testing, none for validation - -#### Preprocessing / Clean up - -* Describe any manipulations you performed to the data. - -#### Data Visualization - -Show a few visualization of the data and say a few words about what you see. +The task is based on the Kaggle House Prices dataset, where the goal is to predict housing prices using structured tabular features such as square footage, number of rooms, and property characteristics. + +In this project, the regression problem was reformulated into a classification task by grouping house prices into three categories: +Low (< $150,000) +Medium ($150,000–$300,000) +High (> $300,000) +The approach includes: +Data cleaning and preprocessing (handling missing values, scaling, encoding) +Feature engineering and transformation +Training a Random Forest model +Evaluating performance using RMSE and R^2 +The model achieved reasonable predictive performance on the validation set, demonstrating that structured features can effectively capture housing price patterns. + +### Summary of Work Done +Data +Type: +Input - CSV file with housing features (numerical + categorical) +Output - Categorical price class (0, 1, 2) +Dataset - Kaggle House Prices dataset +Size: +Approximately 1460 training samples +Approximately 80 features +Split: +70% Training +15% Validation +15% Test + +### Preprocessing / Cleanup +Converted SalePrice into 3 classes: +0 to Low +1 to Medium +2 to High +Handled missing values: +Numerical to median imputation +Categorical to the most frequent value +Feature scaling: +StandardScaler applied to numerical features +Encoding: +One-hot encoding for categorical variables +Removed unnecessary columns: +ID column dropped + +### Data Visualization +Histogram of GrLivArea across price classes showed: +Larger homes tend to fall into higher price classes +Before/after scaling plots confirmed normalization worked +Key insight: +The GrLivArea is strongly correlated with the price category ### Problem Formulation - -* Define: - * Input / Output - * Models - * Describe the different models you tried and why. - * Loss, Optimizer, other Hyperparameters. +Input: Housing features +Output: Price class +Model Used +Random Forest Regressor +Why Random Forest +Handles tabular data well +Works with nonlinear relationships +Robust to overfitting compared to single trees +Metrics +RMSE +R^2 Score ### Training - -* Describe the training: - * How you trained: software and hardware. - * How did training take. - * Training curves (loss vs epoch for test/train). - * How did you decide to stop training. - * Any difficulties? How did you resolve them? - -### Performance Comparison - -* Clearly define the key performance metric(s). -* Show/compare results in one table. -* Show one (or few) visualization(s) of results, for example ROC curves. +Library: scikit-learn +Environment: Jupyter Notebook +Training steps: +Train/validation/test split +Model fit on training data +Evaluation on the validation set +Stopping Criteria: +Default Random Forest parameters + +## Performance Evaluation +To evaluate the model's effectiveness, we used two primary metrics: RMSE and R^2 Score. These metrics provide insight into how well the model predicts the housing price categories ### Conclusions - -* State any conclusions you can infer from your work. Example: LSTM work better than GRU. +Random Forest performed well on tabular housing data +Feature preprocessing was essential +Converting the regression to classification simplified the problem ### Future Work - -* What would be the next thing that you would try. -* What are some other studies that can be done starting from here. - -## How to reproduce results - -* In this section, provide instructions at least one of the following: - * Reproduce your results fully, including training. - * Apply this package to other data. For example, how to use the model you trained. - * Use this package to perform their own study. -* Also describe what resources to use for this package, if appropirate. For example, point them to Collab and TPUs. - -### Overview of files in repository - -* Describe the directory structure, if any. -* List all relavent files and describe their role in the package. -* An example: - * utils.py: various functions that are used in cleaning and visualizing data. - * preprocess.ipynb: Takes input data in CSV and writes out data frame after cleanup. - * visualization.ipynb: Creates various visualizations of the data. - * models.py: Contains functions that build the various models. - * training-model-1.ipynb: Trains the first model and saves model during training. - * training-model-2.ipynb: Trains the second model and saves model during training. - * training-model-3.ipynb: Trains the third model and saves model during training. - * performance.ipynb: loads multiple trained models and compares results. - * inference.ipynb: loads a trained model and applies it to test data to create kaggle submission. - -* Note that all of these notebooks should contain enough text for someone to understand what is happening. +Try true classification models +Hyperparameter tuning +Feature selection to reduce dimensionality +Use original regression instead of binning prices +Try neural networks on tabular data + +### How to Reproduce Results +Download the dataset from Kaggle (House Prices competition) +Place train.csv and test.csv in the project directory +Run notebook: +Data preprocessing +Model training +Prediction generation +Output: +submission.csv for Kaggle + +### Repository Structure +Kaggle Tabular Data.ipynb +Main notebook with full pipeline +Example structure: +preprocessing +visualization +training +submission generation ### Software Setup -* List all of the required packages. -* If not standard, provide or point to instruction for installing the packages. -* Describe how to install your package. +Required Libraries +pandas +numpy +matplotlib +sklearn +Install with - pip install pandas numpy matplotlib sklearn ### Data - -* Point to where they can download the data. -* Lead them through preprocessing steps, if necessary. +Source: Kaggle House Prices Competition +Files +train.csv +test.csv ### Training +Run all notebook cells sequentially +Data cleaning +Feature engineering +Model training -* Describe how to train the model - -#### Performance Evaluation - -* Describe how to run the performance evaluation. - - -## Citations - -* Provide any references. - - - - +### Performance Evaluation +Evaluated using +RMSE +R^2 score +Validation set used for model assessment +| Dataset | RMSE | R² Score | +| -------------- | ---- | -------- | +| Training Set | 0.32 | 0.88 | +| Validation Set | 0.51 | 0.68 | +### Citations +Kaggle House Prices Dataset From 67be8af72f6daa7faea4d49e443bd6f829169fae Mon Sep 17 00:00:00 2001 From: e7thang Date: Fri, 1 May 2026 11:38:46 -0500 Subject: [PATCH 2/3] Add data visualization section to README Added a section for data visualization with an image and table. --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index adaecbe..ac793a6 100644 --- a/README.md +++ b/README.md @@ -137,6 +137,9 @@ RMSE R^2 score Validation set used for model assessment +### Data Visualization +image + | Dataset | RMSE | R² Score | | -------------- | ---- | -------- | | Training Set | 0.32 | 0.88 | From 858f967243ef1bd82baa81613ee33ef86257d732 Mon Sep 17 00:00:00 2001 From: e7thang Date: Fri, 1 May 2026 12:51:43 -0500 Subject: [PATCH 3/3] Enhance README with model performance and details Updated project description with performance metrics and preprocessing details. --- README.md | 128 +++++++++++++++++++++++++++++++++--------------------- 1 file changed, 78 insertions(+), 50 deletions(-) diff --git a/README.md b/README.md index ac793a6..3da33b3 100644 --- a/README.md +++ b/README.md @@ -2,78 +2,95 @@ # House Price Classification using Random Forest on Kaggle Tabular Data -This project applies a Random Forest model to the Kaggle House Prices dataset by transforming continuous sale prices into categorical classes and performing a classification task on tabular housing data. +This project applies a Random Forest model to the Kaggle House Prices dataset by turning sale prices into classes, achieving 0.32 RMSE and 0.88 R^2 on training and 0.51 RMSE and 0.68 R² on validation, with feature scaling reducing GrLivArea:2500 to 0.8 and GarageCars:2 to 0.3. ## Overview The task is based on the Kaggle House Prices dataset, where the goal is to predict housing prices using structured tabular features such as square footage, number of rooms, and property characteristics. In this project, the regression problem was reformulated into a classification task by grouping house prices into three categories: + Low (< $150,000) Medium ($150,000–$300,000) High (> $300,000) + The approach includes: -Data cleaning and preprocessing (handling missing values, scaling, encoding) -Feature engineering and transformation -Training a Random Forest model -Evaluating performance using RMSE and R^2 +Data cleaning and preprocessing, +Feature engineering and transformation, +Training a Random Forest model, +Evaluating performance using RMSE and R^2. The model achieved reasonable predictive performance on the validation set, demonstrating that structured features can effectively capture housing price patterns. +image + ### Summary of Work Done Data Type: -Input - CSV file with housing features (numerical + categorical) -Output - Categorical price class (0, 1, 2) +Input - CSV file with housing features, +Output - Categorical price class, Dataset - Kaggle House Prices dataset + Size: -Approximately 1460 training samples +Approximately 1460 training samples, Approximately 80 features + Split: -70% Training -15% Validation +70% Training, +15% Validation, 15% Test ### Preprocessing / Cleanup Converted SalePrice into 3 classes: -0 to Low -1 to Medium +0 to Low, +1 to Medium, 2 to High + Handled missing values: -Numerical to median imputation +Numerical to median imputation, Categorical to the most frequent value + Feature scaling: StandardScaler applied to numerical features + Encoding: One-hot encoding for categorical variables + Removed unnecessary columns: ID column dropped ### Data Visualization Histogram of GrLivArea across price classes showed: -Larger homes tend to fall into higher price classes +Larger homes tend to fall into higher price classes, Before/after scaling plots confirmed normalization worked + Key insight: The GrLivArea is strongly correlated with the price category ### Problem Formulation Input: Housing features + Output: Price class -Model Used + +Model Used: Random Forest Regressor -Why Random Forest -Handles tabular data well -Works with nonlinear relationships -Robust to overfitting compared to single trees -Metrics -RMSE + +Why Random Forest: +Handles tabular data well, +Works with nonlinear relationships, +Robust to overfitting compared to single trees, + +Metrics: +RMSE, R^2 Score ### Training Library: scikit-learn Environment: Jupyter Notebook + Training steps: -Train/validation/test split -Model fit on training data +Train/validation/test split, +Model fit on training data, Evaluation on the validation set + Stopping Criteria: Default Random Forest parameters @@ -81,43 +98,45 @@ Default Random Forest parameters To evaluate the model's effectiveness, we used two primary metrics: RMSE and R^2 Score. These metrics provide insight into how well the model predicts the housing price categories ### Conclusions -Random Forest performed well on tabular housing data -Feature preprocessing was essential -Converting the regression to classification simplified the problem +Random Forest performed well on tabular housing data, and the feature preprocessing was essential. Converting the regression to classification simplified the problem. ### Future Work -Try true classification models -Hyperparameter tuning -Feature selection to reduce dimensionality -Use original regression instead of binning prices +Try true classification models, +Hyperparameter tuning, +Feature selection to reduce dimensionality, +Use original regression instead of binning prices, Try neural networks on tabular data ### How to Reproduce Results -Download the dataset from Kaggle (House Prices competition) +Download the dataset from Kaggle, Place train.csv and test.csv in the project directory + Run notebook: -Data preprocessing -Model training +Data preprocessing, +Model training, Prediction generation + Output: submission.csv for Kaggle ### Repository Structure -Kaggle Tabular Data.ipynb +Kaggle Tabular Data.ipynb, Main notebook with full pipeline + Example structure: -preprocessing -visualization -training -submission generation +preprocessing, +visualization, +training, +submission generation ### Software Setup Required Libraries -pandas -numpy -matplotlib -sklearn -Install with - pip install pandas numpy matplotlib sklearn +pandas, +numpy, +matplotlib, +scikit-learn + +Install with - pip install pandas numpy matplotlib scikit-learn ### Data Source: Kaggle House Prices Competition @@ -126,19 +145,28 @@ train.csv test.csv ### Training -Run all notebook cells sequentially -Data cleaning -Feature engineering +Run all notebook cells sequentially, +Data cleaning, +Feature engineering, Model training ### Performance Evaluation -Evaluated using -RMSE -R^2 score +Evaluated using - +RMSE, +R^2 score, Validation set used for model assessment ### Data Visualization -image +image + +image + + +| Feature | Before Scaling | After Scaling | +|------------|---------------|--------------| +| GrLivArea | 2500 | 0.8 | +| GarageCars | 2 | 0.3 | + | Dataset | RMSE | R² Score | | -------------- | ---- | -------- |