Building a Machine Learning Model

Rajdeep Pathak

UG-I Mathematics

Data Science, Machine Learning and AI are the three buzz words among today's geeks. More or less everyone of us have some idea about Machine Learning at the superficial level. Machine Learning is a method of data analysis where the models identify specific patterns in the data that it's trained on, and then makes predictions on unseen data with minimum or zero human intervention.

So, how is Machine Learning automation better than automation via traditional programming?

Let me give you a real world example. You always learn from your experiences. Let's say you're not aware of the feeling when you touch a hot plate, just taken out of oven. You touch it, and get a burn. You feel the pain. It hurts you and you decide to never touch a hot plate again. So, this was your experience of touching the hot plate, and you learned from it. Now, you won't be touching anything that immediately comes straight out of oven.

This is the case with Machine Learning. It learns from previous experience and continuously upgrades itself to give the best possible prediction to unseen data. This process of upgrading is hence automated,

Machine Learning can be broadly divided into **Supervised Learning** and **Unsupervised Learning**. I'm not going into a deep explanation of either of them. Two of the **Supervised Learning** approaches are **Regression** and **Classification**.

Regression: When the output variable is continuous, (a certain value, like when we're predicting the price of stocks) the model into action is a Regression model.
Classification: When the output variable is discrete, (for example, whether a credit card transaction is fraud or not) the model into action is a Classification model.

That must be enough for an introduction to Machine Learning, now let's get straight to building a model. Now, let's get our hands on building a model here in no time. We will be building a Regression model to predict the price of houses in King County, USA.

You might be aware of Kaggle, a community for Data Science having a huge library of datasets. This dataset is taken from Kaggle. You can explore Kaggle, get the dataset of your favourite niche or analysis firm, and analyze it to build your project.

If statistics and numbers fascinate you, you're good to read ahead even if you don't know the coding part. If you are at good terms with Python, you are most welcome to read and understand the code to see what it's doing. However, in case you're not that familiar with Python - don't worry, if you just look at the outputs (the things that are happening), you'll well understand what's going on. You'll get to know every step as our model learns by identifying patterns from the data that we train it with, and then predicts prices of houses from unknown data.

Wishing you a great journey ahead!

House Sales in King County, USA¶

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

The following are the parameters of the house. Here, let me ask you a question. What do you think matters, while deciding the price of a house? Yes, you guessed it right. Parameters like number of bedrooms, number of bathrooms, square footage of the home, presence of backyard, etc. affects the price. You can very well understand that the previously mentioned parameters are positively correlated with the price - more is the number of bedrooms or bathrooms or square footage of the house, more is its price. In Machine Learning language, we call these parameters as features.

We will train our model with a set of data containing the features (number of bedrooms, number of bathrooms, square footage, etc) and the respective prices of the house. The model will be forming a mathematical equation relating the price (dependant variable) and the features (independant variable), just like y = f(x) where y is the depandant variable and x is independant.

Following are the columns of the house that we'll have in our data. You must note that not all columns would help us in estimating the price. for example, the house id and date house was sold doesn't play a significant role in the price determination.

id : A notation for a house

date: Date house was sold

price: Price is prediction target

bedrooms: Number of bedrooms

bathrooms: Number of bathrooms

sqft_living: Square footage of the home

sqft_lot: Square footage of the lot

floors :Total floors (levels) in house

waterfront :House which has a view to a waterfront

view: Has been viewed

condition :How good the condition is overall

grade: overall grade given to the housing unit, based on King County grading system

sqft_above : Square footage of house apart from basement

sqft_basement: Square footage of the basement

yr_built : Built Year

yr_renovated : Year when house was renovated

zipcode: Zip code

lat: Latitude coordinate

long: Longitude coordinate

sqft_living15 : Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area

sqft_lot15 : LotSize area in 2015(implies-- some renovations)

Let us import the libraries that we would require. You can skip the code and focus on the outputs to understand what's going on.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
%matplotlib inline

Importing Data Sets¶

We have the data in CSV (Comma separated values) format. We got to load it into a Pandas dataframe. A dataframe is just a table with rows and columns that we will perform our operations on.

In [2]:

file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)

Let us look at the first five rows of the dataset. We use the method head to display the first 5 columns of the dataframe.

In [3]:

df.head()

Out[3]:

	Unnamed: 0	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	0	7129300520	20141013T000000	221900.0	3.0	1.00	1180	5650	1.0	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
1	1	6414100192	20141209T000000	538000.0	3.0	2.25	2570	7242	2.0	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
2	2	5631500400	20150225T000000	180000.0	2.0	1.00	770	10000	1.0	...	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
3	3	2487200875	20141209T000000	604000.0	4.0	3.00	1960	5000	1.0	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
4	4	1954400510	20150218T000000	510000.0	3.0	2.00	1680	8080	1.0	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800	7503

5 rows × 22 columns

Data Wrangling¶

We drop (remove) the columns "id" and "Unnamed: 0" from the dataframe. Then, we use the describe() method to obtain a statistical summary of the data.

In [4]:

df.drop(['id', 'Unnamed: 0'], axis=1, inplace=True)
df.describe()

Out[4]:

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
count	2.161300e+04	21600.000000	21603.000000	21613.000000	2.161300e+04	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000
mean	5.400881e+05	3.372870	2.115736	2079.899736	1.510697e+04	1.494309	0.007542	0.234303	3.409430	7.656873	1788.390691	291.509045	1971.005136	84.402258	98077.939805	47.560053	-122.213896	1986.552492	12768.455652
std	3.671272e+05	0.926657	0.768996	918.440897	4.142051e+04	0.539989	0.086517	0.766318	0.650743	1.175459	828.090978	442.575043	29.373411	401.679240	53.505026	0.138564	0.140828	685.391304	27304.179631
min	7.500000e+04	1.000000	0.500000	290.000000	5.200000e+02	1.000000	0.000000	0.000000	1.000000	1.000000	290.000000	0.000000	1900.000000	0.000000	98001.000000	47.155900	-122.519000	399.000000	651.000000
25%	3.219500e+05	3.000000	1.750000	1427.000000	5.040000e+03	1.000000	0.000000	0.000000	3.000000	7.000000	1190.000000	0.000000	1951.000000	0.000000	98033.000000	47.471000	-122.328000	1490.000000	5100.000000
50%	4.500000e+05	3.000000	2.250000	1910.000000	7.618000e+03	1.500000	0.000000	0.000000	3.000000	7.000000	1560.000000	0.000000	1975.000000	0.000000	98065.000000	47.571800	-122.230000	1840.000000	7620.000000
75%	6.450000e+05	4.000000	2.500000	2550.000000	1.068800e+04	2.000000	0.000000	0.000000	4.000000	8.000000	2210.000000	560.000000	1997.000000	0.000000	98118.000000	47.678000	-122.125000	2360.000000	10083.000000
max	7.700000e+06	33.000000	8.000000	13540.000000	1.651359e+06	3.500000	1.000000	4.000000	5.000000	13.000000	9410.000000	4820.000000	2015.000000	2015.000000	98199.000000	47.777600	-121.315000	6210.000000	871200.000000

The above table shows a statistical summary of the data. The first row count tells us the number of elements present in each column. For example, column price reads 21613 - this means the price column has prices of 21613 houses. The bedrooms column reads 21600, which means that the bedrooms column has the data about bedrooms for 21600 houses. Notice that data about bedrooms of 13 houses are missing. Missing data is one of the most common problem that a Data Scientist has to overcome. We will see that in a minute.

Similarly, the second row mean gives us an average of the data. On an average, each house contains 3 bedrooms and 2 bathrooms (as we see from the mean of bedrooms and bathrooms.

std shows standard deviation of the data, then we have min and max showing the minimum and maximum values of the data respectively.

The 25% row shows the first quartile, 50% shows the second quartile, and 75% shows the third quartile of the data.

We can see we have missing values for the columns bedrooms and bathrooms

In [5]:

print("Number of missing values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("Number of missing values for the column bathrooms :", df['bathrooms'].isnull().sum())

Number of missing values for the column bedrooms : 13
Number of missing values for the column bathrooms : 10

One of the ways in which we can handle missing data is, we can replace it with the mean value of the respective column. For example, we can replace the missing values of the column 'bedrooms' with the mean of the column 'bedrooms' using the method replace().

In [6]:

mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan, mean, inplace=True)

We also replace the missing values of the column 'bathrooms' with the mean of the column 'bathrooms' using the method replace().

In [7]:

mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)

Now that this is done, let us confirm that there's no missing value now!

In [8]:

print("Number of missing values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("Number of missing values for the column bathrooms :", df['bathrooms'].isnull().sum())

Number of missing values for the column bedrooms : 0
Number of missing values for the column bathrooms : 0

Exploratory Data Analysis¶

What we did just now was data wranging, i.e., cleaning up the data - dropping unnecessary columns and dealing with missing data. It's now time to perform Exploratory data analysis and build our Machine Learning Model.

Let us see the number of houses with unique floor values. We use the method value_counts, and then use the method .to_frame() to convert it to a dataframe.

In [9]:

x = df['floors'].value_counts().to_frame()
x

Out[9]:

	floors
1.0	10680
2.0	8241
1.5	1910
3.0	613
2.5	161
3.5	8

As you can see, there are 10680 houses with only 1 floor in our dataset, 8241 houses with 2 floors, and so on.

Data Visualization is one of the essential features in the workflow of data analysis. "A picture speaks a thousand words and always reinforces the concept" - hence, building visualizations in your dashboard is great when you're undergoing a data science task. Use the function boxplot in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.

In [10]:

plt.figure(figsize=(6,5))
sns.boxplot(df['waterfront'], df['price'])
plt.title("Price vs Waterfront view")
plt.show()

What can you determine from the above graph? (The graph is boxplot, so you may need to go through some basic tutorial or brainstorm a bit to understand it. I'll leave it to you as an exercise).

Let's use the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price.

In [13]:

plt.figure(figsize=(8,6))
sns.regplot(df['sqft_above'], df['price'])
plt.title("Correlation of Price vs Square Footage")
plt.show()

We can see a line with a positive slope. This means that sqft_above and price are positively correlated. More is the value of sqft_above (square footage of the house apart from the basement), more is its price.

We can use the Pandas method corr() to find the feature other than price that is most correlated with price.

In [14]:

df.corr()['price'].sort_values()

Out[14]:

zipcode         -0.053203
long             0.021626
condition        0.036362
yr_built         0.054012
sqft_lot15       0.082447
sqft_lot         0.089661
yr_renovated     0.126434
floors           0.256794
waterfront       0.266369
lat              0.307003
bedrooms         0.308797
sqft_basement    0.323816
view             0.397293
bathrooms        0.525738
sqft_living15    0.585379
sqft_above       0.605567
grade            0.667434
sqft_living      0.702035
price            1.000000
Name: price, dtype: float64

As you know from the concept of correlation, the correlation coefficient can only be between -1 and 1. A negative correlation coefficient signifies that when the value of the feature increases, the price decreases and vice versa. A positive correlation coefficient signifies that when the value of the feature increases, the price increases and vice versa.

In the above output, we calculated the correlation coefficients of all features with the price.

Model Development¶

At this point of time, we have our dataset prepared and some basic analysis done. Buckle up, as we are now going to develop our machine learning model that would predict the prices of new houses given its features!

We will essentially split our dataset into two sets: training set and testing set. We will train the model using the training set and then predict prices of houses in the testing set. We will then calculate the accuracy of the model using the R^2 accuracy estimation. In this accuracy measurement, we will have a value between 0 and 1. The more the number is closer to 1, the more accurate is our model.

We first fit a linear regression model to predict the 'price' using the list of features:

In [15]:

features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]     

Then calculate the R^2.

In [17]:

lm = LinearRegression()
Z = df[features]
lm.fit(Z, df['price'])
lm.score(Z, df['price'])

Out[17]:

0.657679183672129

We are now at the final stage of our mini project where we split our dataset into two parts, train the model and then test it with the respective parts. First, let's import the necessary modules.

In [18]:

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
print("done")

done

We will split the data into training and testing sets. We put the test_size=0.15, this means that 15% of the data will be used for testing and the remaining 85% will be used for training.

In [19]:

features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]    
X = df[features]
Y = df['price']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)


print("Number of test samples:", x_test.shape[0])
print("Number of training samples:",x_train.shape[0])

Number of test samples: 3242
Number of training samples: 18371

Earlier, we used the Linear Regression model to train the data and got an accuracy score of 0.657679183672129.

Now, lets Create and fit a first order Ridge Regression object using the training data, set the regularization parameter to 0.1, and calculate the R^2 using the test data. (Ridge regression is another regression algorithm where the features are scaled using a regularization parameter alpha. You don't need to bother about this, let's see which model would give a better accuracy score - the Linear Regression or Ridge Regression!)

In [20]:

from sklearn.linear_model import Ridge

In [21]:

RR = Ridge(alpha=0.1)
RR.fit(x_train, y_train)  # this is where we are training the model
RR.score(x_test, y_test)  # this is where we are passing the test set to calculate accuracy score

Out[21]:

0.6478759163939122

We see that a first order Ridge regression model gives less accuracy score than that of Linear Regression model. Let us perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, set the regularisation parameter to 0.1, and calculate the R^2 utilising the test data provided.

In [22]:

Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(degree=2, include_bias=False)),('model',LinearRegression())]
pipe = Pipeline(Input)
pipe.fit(x_train, y_train)
pipe.fit(x_test, y_test)
RR = Ridge(alpha=0.1)
RR.fit(x_train, y_train) # training the model
RR.score(x_test, y_test) # calculating accuracy score

Out[22]:

0.6478759163939122

So, which model gives a better accuracy? The Linear Regression model in this case.

Conclusion¶

We just built a machine learning model that got trained with some data. We calculated its accuracy on the basis of how well it performed on prediction of the unseen data (test set). The world's largest Deep learning model (as of now) is Microsoft's Turing Natural Language Generation (T-NLG). It has got 17 billion parameters (features)! Whereas in this mini project, we had just 11 parameters).

Machine Learning and AI is greatly automating things and is expected to take over multiple jobs in near future!