Building a Machine Learning Model

Rajdeep Pathak


UG-I Mathematics

Data Science, Machine Learning and AI are the three buzz words among today's geeks. More or less everyone of us have some idea about Machine Learning at the superficial level. Machine Learning is a method of data analysis where the models identify specific patterns in the data that it's trained on, and then makes predictions on unseen data with minimum or zero human intervention.

So, how is Machine Learning automation better than automation via traditional programming?

Let me give you a real world example. You always learn from your experiences. Let's say you're not aware of the feeling when you touch a hot plate, just taken out of oven. You touch it, and get a burn. You feel the pain. It hurts you and you decide to never touch a hot plate again. So, this was your experience of touching the hot plate, and you learned from it. Now, you won't be touching anything that immediately comes straight out of oven.

This is the case with Machine Learning. It learns from previous experience and continuously upgrades itself to give the best possible prediction to unseen data. This process of upgrading is hence automated,


Machine Learning can be broadly divided into **Supervised Learning** and **Unsupervised Learning**. I'm not going into a deep explanation of either of them. Two of the **Supervised Learning** approaches are **Regression** and **Classification**.
  • Regression: When the output variable is continuous, (a certain value, like when we're predicting the price of stocks) the model into action is a Regression model.

  • Classification: When the output variable is discrete, (for example, whether a credit card transaction is fraud or not) the model into action is a Classification model.


That must be enough for an introduction to Machine Learning, now let's get straight to building a model. Now, let's get our hands on building a model here in no time. We will be building a Regression model to predict the price of houses in King County, USA.

You might be aware of Kaggle, a community for Data Science having a huge library of datasets. This dataset is taken from Kaggle. You can explore Kaggle, get the dataset of your favourite niche or analysis firm, and analyze it to build your project.

If statistics and numbers fascinate you, you're good to read ahead even if you don't know the coding part. If you are at good terms with Python, you are most welcome to read and understand the code to see what it's doing. However, in case you're not that familiar with Python - don't worry, if you just look at the outputs (the things that are happening), you'll well understand what's going on. You'll get to know every step as our model learns by identifying patterns from the data that we train it with, and then predicts prices of houses from unknown data.

Wishing you a great journey ahead!

House Sales in King County, USA¶

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

The following are the parameters of the house. Here, let me ask you a question. What do you think matters, while deciding the price of a house? Yes, you guessed it right. Parameters like number of bedrooms, number of bathrooms, square footage of the home, presence of backyard, etc. affects the price. You can very well understand that the previously mentioned parameters are positively correlated with the price - more is the number of bedrooms or bathrooms or square footage of the house, more is its price. In Machine Learning language, we call these parameters as features.

We will train our model with a set of data containing the features (number of bedrooms, number of bathrooms, square footage, etc) and the respective prices of the house. The model will be forming a mathematical equation relating the price (dependant variable) and the features (independant variable), just like y = f(x) where y is the depandant variable and x is independant.

Following are the columns of the house that we'll have in our data. You must note that not all columns would help us in estimating the price. for example, the house id and date house was sold doesn't play a significant role in the price determination.

id : A notation for a house

date: Date house was sold

price: Price is prediction target

bedrooms: Number of bedrooms

bathrooms: Number of bathrooms

sqft_living: Square footage of the home

sqft_lot: Square footage of the lot

floors :Total floors (levels) in house

waterfront :House which has a view to a waterfront

view: Has been viewed

condition :How good the condition is overall

grade: overall grade given to the housing unit, based on King County grading system

sqft_above : Square footage of house apart from basement

sqft_basement: Square footage of the basement

yr_built : Built Year

yr_renovated : Year when house was renovated

zipcode: Zip code

lat: Latitude coordinate

long: Longitude coordinate

sqft_living15 : Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area

sqft_lot15 : LotSize area in 2015(implies-- some renovations)

Let us import the libraries that we would require. You can skip the code and focus on the outputs to understand what's going on.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
%matplotlib inline

Importing Data Sets¶

We have the data in CSV (Comma separated values) format. We got to load it into a Pandas dataframe. A dataframe is just a table with rows and columns that we will perform our operations on.

In [2]:
file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)

Let us look at the first five rows of the dataset. We use the method head to display the first 5 columns of the dataframe.

In [3]:
df.head()
Out[3]:
Unnamed: 0 id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 0 7129300520 20141013T000000 221900.0 3.0 1.00 1180 5650 1.0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 1 6414100192 20141209T000000 538000.0 3.0 2.25 2570 7242 2.0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 2 5631500400 20150225T000000 180000.0 2.0 1.00 770 10000 1.0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 3 2487200875 20141209T000000 604000.0 4.0 3.00 1960 5000 1.0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 4 1954400510 20150218T000000 510000.0 3.0 2.00 1680 8080 1.0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503

5 rows × 22 columns

Data Wrangling¶

We drop (remove) the columns "id" and "Unnamed: 0" from the dataframe. Then, we use the describe() method to obtain a statistical summary of the data.

In [4]:
df.drop(['id', 'Unnamed: 0'], axis=1, inplace=True)
df.describe()
Out[4]:
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 2.161300e+04 21600.000000 21603.000000 21613.000000 2.161300e+04 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000 21613.000000
mean 5.400881e+05 3.372870 2.115736 2079.899736 1.510697e+04 1.494309 0.007542 0.234303 3.409430 7.656873 1788.390691 291.509045 1971.005136 84.402258 98077.939805 47.560053 -122.213896 1986.552492 12768.455652
std 3.671272e+05 0.926657 0.768996 918.440897 4.142051e+04 0.539989 0.086517 0.766318 0.650743 1.175459 828.090978 442.575043 29.373411 401.679240 53.505026 0.138564 0.140828 685.391304 27304.179631
min 7.500000e+04 1.000000 0.500000 290.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 1.000000 290.000000 0.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 3.219500e+05 3.000000 1.750000 1427.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1190.000000 0.000000 1951.000000 0.000000 98033.000000 47.471000 -122.328000 1490.000000 5100.000000
50% 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 3.000000 7.000000 1560.000000 0.000000 1975.000000 0.000000 98065.000000 47.571800 -122.230000 1840.000000 7620.000000
75% 6.450000e+05 4.000000 2.500000 2550.000000 1.068800e+04 2.000000 0.000000 0.000000 4.000000 8.000000 2210.000000 560.000000 1997.000000 0.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 13.000000 9410.000000 4820.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000

The above table shows a statistical summary of the data. The first row count tells us the number of elements present in each column. For example, column price reads 21613 - this means the price column has prices of 21613 houses. The bedrooms column reads 21600, which means that the bedrooms column has the data about bedrooms for 21600 houses. Notice that data about bedrooms of 13 houses are missing. Missing data is one of the most common problem that a Data Scientist has to overcome. We will see that in a minute.

Similarly, the second row mean gives us an average of the data. On an average, each house contains 3 bedrooms and 2 bathrooms (as we see from the mean of bedrooms and bathrooms.

std shows standard deviation of the data, then we have min and max showing the minimum and maximum values of the data respectively.

The 25% row shows the first quartile, 50% shows the second quartile, and 75% shows the third quartile of the data.

We can see we have missing values for the columns bedrooms and bathrooms

In [5]:
print("Number of missing values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("Number of missing values for the column bathrooms :", df['bathrooms'].isnull().sum())
Number of missing values for the column bedrooms : 13
Number of missing values for the column bathrooms : 10

One of the ways in which we can handle missing data is, we can replace it with the mean value of the respective column. For example, we can replace the missing values of the column 'bedrooms' with the mean of the column 'bedrooms' using the method replace().

In [6]:
mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan, mean, inplace=True)

We also replace the missing values of the column 'bathrooms' with the mean of the column 'bathrooms' using the method replace().

In [7]:
mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)

Now that this is done, let us confirm that there's no missing value now!

In [8]:
print("Number of missing values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("Number of missing values for the column bathrooms :", df['bathrooms'].isnull().sum())
Number of missing values for the column bedrooms : 0
Number of missing values for the column bathrooms : 0

Exploratory Data Analysis¶

What we did just now was data wranging, i.e., cleaning up the data - dropping unnecessary columns and dealing with missing data. It's now time to perform Exploratory data analysis and build our Machine Learning Model.

Let us see the number of houses with unique floor values. We use the method value_counts, and then use the method .to_frame() to convert it to a dataframe.

In [9]:
x = df['floors'].value_counts().to_frame()
x
Out[9]:
floors
1.0 10680
2.0 8241
1.5 1910
3.0 613
2.5 161
3.5 8

As you can see, there are 10680 houses with only 1 floor in our dataset, 8241 houses with 2 floors, and so on.

Data Visualization is one of the essential features in the workflow of data analysis. "A picture speaks a thousand words and always reinforces the concept" - hence, building visualizations in your dashboard is great when you're undergoing a data science task. Use the function boxplot in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.

In [10]:
plt.figure(figsize=(6,5))
sns.boxplot(df['waterfront'], df['price'])
plt.title("Price vs Waterfront view")
plt.show()

What can you determine from the above graph? (The graph is boxplot, so you may need to go through some basic tutorial or brainstorm a bit to understand it. I'll leave it to you as an exercise).

Let's use the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price.

In [13]:
plt.figure(figsize=(8,6))
sns.regplot(df['sqft_above'], df['price'])
plt.title("Correlation of Price vs Square Footage")
plt.show()

We can see a line with a positive slope. This means that sqft_above and price are positively correlated. More is the value of sqft_above (square footage of the house apart from the basement), more is its price.

We can use the Pandas method corr() to find the feature other than price that is most correlated with price.

In [14]:
df.corr()['price'].sort_values()
Out[14]:
zipcode         -0.053203
long             0.021626
condition        0.036362
yr_built         0.054012
sqft_lot15       0.082447
sqft_lot         0.089661
yr_renovated     0.126434
floors           0.256794
waterfront       0.266369
lat              0.307003
bedrooms         0.308797
sqft_basement    0.323816
view             0.397293
bathrooms        0.525738
sqft_living15    0.585379
sqft_above       0.605567
grade            0.667434
sqft_living      0.702035
price            1.000000
Name: price, dtype: float64

As you know from the concept of correlation, the correlation coefficient can only be between -1 and 1. A negative correlation coefficient signifies that when the value of the feature increases, the price decreases and vice versa. A positive correlation coefficient signifies that when the value of the feature increases, the price increases and vice versa.

In the above output, we calculated the correlation coefficients of all features with the price.

Model Development¶

At this point of time, we have our dataset prepared and some basic analysis done. Buckle up, as we are now going to develop our machine learning model that would predict the prices of new houses given its features!

We will essentially split our dataset into two sets: training set and testing set. We will train the model using the training set and then predict prices of houses in the testing set. We will then calculate the accuracy of the model using the R^2 accuracy estimation. In this accuracy measurement, we will have a value between 0 and 1. The more the number is closer to 1, the more accurate is our model.

We first fit a linear regression model to predict the 'price' using the list of features:

In [15]:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]     

Then calculate the R^2.

In [17]:
lm = LinearRegression()
Z = df[features]
lm.fit(Z, df['price'])
lm.score(Z, df['price'])
Out[17]:
0.657679183672129

Model Evaluation and Refinement¶

We are now at the final stage of our mini project where we split our dataset into two parts, train the model and then test it with the respective parts. First, let's import the necessary modules.

In [18]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
print("done")
done

We will split the data into training and testing sets. We put the test_size=0.15, this means that 15% of the data will be used for testing and the remaining 85% will be used for training.

In [19]:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]    
X = df[features]
Y = df['price']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)


print("Number of test samples:", x_test.shape[0])
print("Number of training samples:",x_train.shape[0])
Number of test samples: 3242
Number of training samples: 18371

Earlier, we used the Linear Regression model to train the data and got an accuracy score of 0.657679183672129.

Now, lets Create and fit a first order Ridge Regression object using the training data, set the regularization parameter to 0.1, and calculate the R^2 using the test data. (Ridge regression is another regression algorithm where the features are scaled using a regularization parameter alpha. You don't need to bother about this, let's see which model would give a better accuracy score - the Linear Regression or Ridge Regression!)

In [20]:
from sklearn.linear_model import Ridge
In [21]:
RR = Ridge(alpha=0.1)
RR.fit(x_train, y_train)  # this is where we are training the model
RR.score(x_test, y_test)  # this is where we are passing the test set to calculate accuracy score
Out[21]:
0.6478759163939122

We see that a first order Ridge regression model gives less accuracy score than that of Linear Regression model. Let us perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, set the regularisation parameter to 0.1, and calculate the R^2 utilising the test data provided.

In [22]:
Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(degree=2, include_bias=False)),('model',LinearRegression())]
pipe = Pipeline(Input)
pipe.fit(x_train, y_train)
pipe.fit(x_test, y_test)
RR = Ridge(alpha=0.1)
RR.fit(x_train, y_train) # training the model
RR.score(x_test, y_test) # calculating accuracy score
Out[22]:
0.6478759163939122

So, which model gives a better accuracy? The Linear Regression model in this case.

Conclusion¶

We just built a machine learning model that got trained with some data. We calculated its accuracy on the basis of how well it performed on prediction of the unseen data (test set). The world's largest Deep learning model (as of now) is Microsoft's Turing Natural Language Generation (T-NLG). It has got 17 billion parameters (features)! Whereas in this mini project, we had just 11 parameters).

Machine Learning and AI is greatly automating things and is expected to take over multiple jobs in near future!