Predicting a house price using ML .NET

ML .Net is an opensource cross-platform machine learning framework intended for .NET developers. Python(with routines are written in C++) is generally used to develop many ML libraries, e.g. TensorFlow, and this can add extra steps and hurdles when you need to tightly integrate ML components on the .Net platform. ML .Net provides a great set of tools to let you integrate machine learning applications using .NET – you can find out more about ML .NET here

ML .NET lets you develop a range of ML systems

Forecasting/Regression
Issue Classification
Predictive maintenance
Image classification
Sentiment Analysis
Recommender Systems
Clustering systems

ML .NET provides a developer-friendly API for machine learning, that supports the typical ML workflow:

Loading different types of data (Test, IEnumerable, Binary, Parquet, File Sets)
Transform Data Feature selection, normalization, changing the schema, category encoding, handling missing data
Choosing an appropriate algorithm for your problem (Linear, Boosted trees, SVM, K-Means).
Training (Training and evaluating models)
Evaluating the model
Deployment and Running(using the trained model)

A significant advantage of using the ML .NET framework is that it allows the user to quickly experiment with different learning algorithms, changing the set of features, sizes of training and test datasets to get the best results for their problem. Experimentation avoids a common issue where teams spend a lot of time collecting unnecessary data and produce models that do not perform well.

Learning Pipelines

ML .NET combines transforms for data preparation and model training into a single pipeline, these are then applied to training data and the input data used to make predictions in your model.

When discussing ML .NET, it is important to recognise to use of:

Transformers - these convert and manipulate data and produce data as an output.
Estimators - these take data and provide a transformer or a model, e.g. when training
Prediction - this uses a single row of features and predicts a single row of results.

key concept

We will see how these come into play in the simple regression sample.

House price sample

To understand how the functionality fits into the typical workflow of data preparation, training the model and evaluating the fit using test data sets and using the model. I took a look at implementing a simple regression application to predict the sale price of a house given a simple set of features over about 800 home sales. In the sample, we are going to take a look at a supervised learning problem of Multivariate linear regression. A supervised learning task is used to predice the value of a label from a set of features. In this case, we want to use one or more features to predict the sale price(the label) of a house.

The focus was on getting a small sample up and running, that can then be used to experiment with the choice of feature and training algorithms. You can find the code for this article on GitHub here

We will train the model using a set of sales data to predict the sale price of a house given a set of features over about 800 home sales. While the sample data has a wide range of features, a key aspect of developing a useful system would be to understand the choice of features used affects the model.

Before you start

You can find a 10 minute tutorial that will take you through installing and pre requisits - or just use the following steps.

You will need to have Visual Studio 16.6 or later with the .NET Core installed you can get there here

To start building .NET apps you just need to download and install the .NET SDK.

Install the ML.Net Package by running the following in a command prompt

dotnet add package Microsoft.ML --version 0.10.0

Data class

Our first job is to define a data class that we can use when loading our .csv file of house data. The important part to note is the [LoadColumn()] attributes; these allow us to associate the fields to different columns in the input. It gives us a simple way to adapt to changes in the data sets we can process. When a user wants to predict the price of a house they use the data class to give the features. Note, we do not need to use all the fields in the class when training the model.

CodeSnip 1

Training and Saving the model

CreateHousePriceModelUsingPipeline(…) method does most of the interesting work in creating the model used to predict house prices.

The code snippet shows how you can:

Read the data from a .csv file
Choose the training algorithm
Choose features to use when training
Handle string features
Create a pipeline to process the data and train the model

In these types of problem you will typically need to normalise the inbound data, consider the feature Rooms and BedRooms, the range of values of Rooms is usually larger then BedRooms, we normalise them to have the same influence on fit, and to speed up (depending on the trainer) the time to fitting. The trainer we using automatically normalises the features, though the framework provides tools to support normalizing if you need to do this yourself.

Similarly, depending on the number of features used (and if the model is overfitting) we apply Regularization to the trainer - this essentially keeps all the features but adds a weight to each of the features parameters to reduce the effect. In the sample, the trainer will handle regularisation; alternatively, you can make regularisation adjustments when creating the trainer.

CodeSnip 2

Evaluation

After training we need to evaluate our model using test data, this will indicate the size of the error between the predicted result and the actual results. Reducing the error will be part of an iterative process on a relatively small set of data to determine the best mix of features. There are different approaches supported by ML .NET We use cross-validation to estimate the variance of the model quality from one run to another, it and also eliminates the need to extract a separate test set for evaluation. We display the quality metrics to evaluate and get the model’s accuracy metrics

CodeSnip 3

We will look at two metrics:

L1 Loss - you need to minimise this, though if the input labels are not normalised, this can be quite high.
R-squared - provides a measure of the goodness of fit, for linear regressions models will be between 0 -> 1, the closer to 1 the better.

The absolute loss is defined as L1 = (1/m) * sum( abs( yi - y’i))

You can find a desciption of the available metrics here

Running the code as is yeilds an R-Sqaured of 0.608, we will want to improve this. A first step would be to start to experiment with the feature selection, perhaps to start with a reduced list of features after we have a clearer understanding then consider getting a wider set of data.

Saving the model for later use

It is simple to store the model used to predict house prices


// Save the model for later comsumption from end-user apps
using (var file = File.OpenWrite(outputModelPath))
    model.SaveTo(mlContext, file);

Load and predict house sale prices

Once you have tweaked the features and evaluate different training, you can then use the model to predict sales prices. I think this where the ML .NET framework shines because we can then use the cool tools in .Net to support different ways to use the model.

CodeSnip 4

For this type of ML application, a typical use would be to create a simple REST service running in a docker container deployed to Windows Azure. A web app written in javascript consumes the service to let people quickly see what a house should sell for.

Using .Net Core we can run the backend on different hardware platforms, Visual Studio 2019, 2017 makes the creation, deployment, and management of a robust service quick and easy.

Find the code in GitHub:

GitHub