Rajdeep Pathak
Data Science, Machine Learning and AI are the three buzz words among today's geeks. More or less everyone of us have some idea about Machine Learning at the superficial level. Machine Learning is a method of data analysis where the models identify specific patterns in the data that it's trained on, and then makes predictions on unseen data with minimum or zero human intervention.
So, how is Machine Learning automation better than automation via traditional programming?
Let me give you a real world example. You always learn from your experiences. Let's say you're not aware of the feeling when you touch a hot plate, just taken out of oven. You touch it, and get a burn. You feel the pain. It hurts you and you decide to never touch a hot plate again. So, this was your experience of touching the hot plate, and you learned from it. Now, you won't be touching anything that immediately comes straight out of oven.
This is the case with Machine Learning. It learns from previous experience and continuously upgrades itself to give the best possible prediction to unseen data. This process of upgrading is hence automated,
Regression: When the output variable is continuous, (a certain value, like when we're predicting the price of stocks) the model into action is a Regression model.
Classification: When the output variable is discrete, (for example, whether a credit card transaction is fraud or not) the model into action is a Classification model.
That must be enough for an introduction to Machine Learning, now let's get straight to building a model. Now, let's get our hands on building a model here in no time. We will be building a Regression model to predict the price of houses in King County, USA.
You might be aware of Kaggle, a community for Data Science having a huge library of datasets. This dataset is taken from Kaggle. You can explore Kaggle, get the dataset of your favourite niche or analysis firm, and analyze it to build your project.
If statistics and numbers fascinate you, you're good to read ahead even if you don't know the coding part. If you are at good terms with Python, you are most welcome to read and understand the code to see what it's doing. However, in case you're not that familiar with Python - don't worry, if you just look at the outputs (the things that are happening), you'll well understand what's going on. You'll get to know every step as our model learns by identifying patterns from the data that we train it with, and then predicts prices of houses from unknown data.
Wishing you a great journey ahead!
This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.
The following are the parameters of the house. Here, let me ask you a question. What do you think matters, while deciding the price of a house? Yes, you guessed it right. Parameters like number of bedrooms, number of bathrooms, square footage of the home, presence of backyard, etc. affects the price. You can very well understand that the previously mentioned parameters are positively correlated with the price - more is the number of bedrooms or bathrooms or square footage of the house, more is its price. In Machine Learning language, we call these parameters as features.
We will train our model with a set of data containing the features (number of bedrooms, number of bathrooms, square footage, etc) and the respective prices of the house. The model will be forming a mathematical equation relating the price (dependant variable) and the features (independant variable), just like y = f(x) where y is the depandant variable and x is independant.
Following are the columns of the house that we'll have in our data. You must note that not all columns would help us in estimating the price. for example, the house id and date house was sold doesn't play a significant role in the price determination.
id : A notation for a house
date: Date house was sold
price: Price is prediction target
bedrooms: Number of bedrooms
bathrooms: Number of bathrooms
sqft_living: Square footage of the home
sqft_lot: Square footage of the lot
floors :Total floors (levels) in house
waterfront :House which has a view to a waterfront
view: Has been viewed
condition :How good the condition is overall
grade: overall grade given to the housing unit, based on King County grading system
sqft_above : Square footage of house apart from basement
sqft_basement: Square footage of the basement
yr_built : Built Year
yr_renovated : Year when house was renovated
zipcode: Zip code
lat: Latitude coordinate
long: Longitude coordinate
sqft_living15 : Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
sqft_lot15 : LotSize area in 2015(implies-- some renovations)
Let us import the libraries that we would require. You can skip the code and focus on the outputs to understand what's going on.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
%matplotlib inline
We have the data in CSV (Comma separated values) format. We got to load it into a Pandas dataframe. A dataframe is just a table with rows and columns that we will perform our operations on.
file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)
Let us look at the first five rows of the dataset.
We use the method head
to display the first 5 columns of the dataframe.
df.head()
Unnamed: 0 | id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 7129300520 | 20141013T000000 | 221900.0 | 3.0 | 1.00 | 1180 | 5650 | 1.0 | 0 | ... | 7 | 1180 | 0 | 1955 | 0 | 98178 | 47.5112 | -122.257 | 1340 | 5650 |
1 | 1 | 6414100192 | 20141209T000000 | 538000.0 | 3.0 | 2.25 | 2570 | 7242 | 2.0 | 0 | ... | 7 | 2170 | 400 | 1951 | 1991 | 98125 | 47.7210 | -122.319 | 1690 | 7639 |
2 | 2 | 5631500400 | 20150225T000000 | 180000.0 | 2.0 | 1.00 | 770 | 10000 | 1.0 | 0 | ... | 6 | 770 | 0 | 1933 | 0 | 98028 | 47.7379 | -122.233 | 2720 | 8062 |
3 | 3 | 2487200875 | 20141209T000000 | 604000.0 | 4.0 | 3.00 | 1960 | 5000 | 1.0 | 0 | ... | 7 | 1050 | 910 | 1965 | 0 | 98136 | 47.5208 | -122.393 | 1360 | 5000 |
4 | 4 | 1954400510 | 20150218T000000 | 510000.0 | 3.0 | 2.00 | 1680 | 8080 | 1.0 | 0 | ... | 8 | 1680 | 0 | 1987 | 0 | 98074 | 47.6168 | -122.045 | 1800 | 7503 |
5 rows × 22 columns
We drop (remove) the columns "id"
and "Unnamed: 0"
from the dataframe. Then, we use the describe() method to obtain a statistical summary of the data.
df.drop(['id', 'Unnamed: 0'], axis=1, inplace=True)
df.describe()
price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2.161300e+04 | 21600.000000 | 21603.000000 | 21613.000000 | 2.161300e+04 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 |
mean | 5.400881e+05 | 3.372870 | 2.115736 | 2079.899736 | 1.510697e+04 | 1.494309 | 0.007542 | 0.234303 | 3.409430 | 7.656873 | 1788.390691 | 291.509045 | 1971.005136 | 84.402258 | 98077.939805 | 47.560053 | -122.213896 | 1986.552492 | 12768.455652 |
std | 3.671272e+05 | 0.926657 | 0.768996 | 918.440897 | 4.142051e+04 | 0.539989 | 0.086517 | 0.766318 | 0.650743 | 1.175459 | 828.090978 | 442.575043 | 29.373411 | 401.679240 | 53.505026 | 0.138564 | 0.140828 | 685.391304 | 27304.179631 |
min | 7.500000e+04 | 1.000000 | 0.500000 | 290.000000 | 5.200000e+02 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 290.000000 | 0.000000 | 1900.000000 | 0.000000 | 98001.000000 | 47.155900 | -122.519000 | 399.000000 | 651.000000 |
25% | 3.219500e+05 | 3.000000 | 1.750000 | 1427.000000 | 5.040000e+03 | 1.000000 | 0.000000 | 0.000000 | 3.000000 | 7.000000 | 1190.000000 | 0.000000 | 1951.000000 | 0.000000 | 98033.000000 | 47.471000 | -122.328000 | 1490.000000 | 5100.000000 |
50% | 4.500000e+05 | 3.000000 | 2.250000 | 1910.000000 | 7.618000e+03 | 1.500000 | 0.000000 | 0.000000 | 3.000000 | 7.000000 | 1560.000000 | 0.000000 | 1975.000000 | 0.000000 | 98065.000000 | 47.571800 | -122.230000 | 1840.000000 | 7620.000000 |
75% | 6.450000e+05 | 4.000000 | 2.500000 | 2550.000000 | 1.068800e+04 | 2.000000 | 0.000000 | 0.000000 | 4.000000 | 8.000000 | 2210.000000 | 560.000000 | 1997.000000 | 0.000000 | 98118.000000 | 47.678000 | -122.125000 | 2360.000000 | 10083.000000 |
max | 7.700000e+06 | 33.000000 | 8.000000 | 13540.000000 | 1.651359e+06 | 3.500000 | 1.000000 | 4.000000 | 5.000000 | 13.000000 | 9410.000000 | 4820.000000 | 2015.000000 | 2015.000000 | 98199.000000 | 47.777600 | -121.315000 | 6210.000000 | 871200.000000 |
The above table shows a statistical summary of the data. The first row count tells us the number of elements present in each column. For example, column price reads 21613 - this means the price column has prices of 21613 houses. The bedrooms column reads 21600, which means that the bedrooms column has the data about bedrooms for 21600 houses. Notice that data about bedrooms of 13 houses are missing. Missing data is one of the most common problem that a Data Scientist has to overcome. We will see that in a minute.
Similarly, the second row mean gives us an average of the data. On an average, each house contains 3 bedrooms and 2 bathrooms (as we see from the mean of bedrooms and bathrooms.
std shows standard deviation of the data, then we have min and max showing the minimum and maximum values of the data respectively.
The 25% row shows the first quartile, 50% shows the second quartile, and 75% shows the third quartile of the data.
We can see we have missing values for the columns bedrooms
and bathrooms
print("Number of missing values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("Number of missing values for the column bathrooms :", df['bathrooms'].isnull().sum())
Number of missing values for the column bedrooms : 13 Number of missing values for the column bathrooms : 10
One of the ways in which we can handle missing data is, we can replace it with the mean value of the respective column.
For example, we can replace the missing values of the column 'bedrooms'
with the mean of the column 'bedrooms'
using the method replace()
.
mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan, mean, inplace=True)
We also replace the missing values of the column 'bathrooms'
with the mean of the column 'bathrooms'
using the method replace()
.
mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)
Now that this is done, let us confirm that there's no missing value now!
print("Number of missing values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("Number of missing values for the column bathrooms :", df['bathrooms'].isnull().sum())
Number of missing values for the column bedrooms : 0 Number of missing values for the column bathrooms : 0
What we did just now was data wranging, i.e., cleaning up the data - dropping unnecessary columns and dealing with missing data. It's now time to perform Exploratory data analysis and build our Machine Learning Model.
Let us see the number of houses with unique floor values.
We use the method value_counts
, and then use the method .to_frame()
to convert it to a dataframe.
x = df['floors'].value_counts().to_frame()
x
floors | |
---|---|
1.0 | 10680 |
2.0 | 8241 |
1.5 | 1910 |
3.0 | 613 |
2.5 | 161 |
3.5 | 8 |
As you can see, there are 10680 houses with only 1 floor in our dataset, 8241 houses with 2 floors, and so on.
Data Visualization is one of the essential features in the workflow of data analysis. "A picture speaks a thousand words and always reinforces the concept" - hence, building visualizations in your dashboard is great when you're undergoing a data science task.
Use the function boxplot
in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.
plt.figure(figsize=(6,5))
sns.boxplot(df['waterfront'], df['price'])
plt.title("Price vs Waterfront view")
plt.show()
What can you determine from the above graph? (The graph is boxplot, so you may need to go through some basic tutorial or brainstorm a bit to understand it. I'll leave it to you as an exercise).
Let's use the function regplot
in the seaborn library to determine if the feature sqft_above
is negatively or positively correlated with price.
plt.figure(figsize=(8,6))
sns.regplot(df['sqft_above'], df['price'])
plt.title("Correlation of Price vs Square Footage")
plt.show()
We can see a line with a positive slope. This means that sqft_above and price are positively correlated. More is the value of sqft_above (square footage of the house apart from the basement), more is its price.
We can use the Pandas method corr()
to find the feature other than price that is most correlated with price.
df.corr()['price'].sort_values()
zipcode -0.053203 long 0.021626 condition 0.036362 yr_built 0.054012 sqft_lot15 0.082447 sqft_lot 0.089661 yr_renovated 0.126434 floors 0.256794 waterfront 0.266369 lat 0.307003 bedrooms 0.308797 sqft_basement 0.323816 view 0.397293 bathrooms 0.525738 sqft_living15 0.585379 sqft_above 0.605567 grade 0.667434 sqft_living 0.702035 price 1.000000 Name: price, dtype: float64
As you know from the concept of correlation, the correlation coefficient can only be between -1 and 1. A negative correlation coefficient signifies that when the value of the feature increases, the price decreases and vice versa. A positive correlation coefficient signifies that when the value of the feature increases, the price increases and vice versa.
In the above output, we calculated the correlation coefficients of all features with the price.
At this point of time, we have our dataset prepared and some basic analysis done. Buckle up, as we are now going to develop our machine learning model that would predict the prices of new houses given its features!
We will essentially split our dataset into two sets: training set and testing set. We will train the model using the training set and then predict prices of houses in the testing set. We will then calculate the accuracy of the model using the R^2 accuracy estimation. In this accuracy measurement, we will have a value between 0 and 1. The more the number is closer to 1, the more accurate is our model.
We first fit a linear regression model to predict the 'price'
using the list of features:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]
Then calculate the R^2.
lm = LinearRegression()
Z = df[features]
lm.fit(Z, df['price'])
lm.score(Z, df['price'])
0.657679183672129
We are now at the final stage of our mini project where we split our dataset into two parts, train the model and then test it with the respective parts. First, let's import the necessary modules.
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
print("done")
done
We will split the data into training and testing sets. We put the test_size=0.15, this means that 15% of the data will be used for testing and the remaining 85% will be used for training.
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]
X = df[features]
Y = df['price']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)
print("Number of test samples:", x_test.shape[0])
print("Number of training samples:",x_train.shape[0])
Number of test samples: 3242 Number of training samples: 18371
Earlier, we used the Linear Regression model to train the data and got an accuracy score of 0.657679183672129.
Now, lets Create and fit a first order Ridge Regression object using the training data, set the regularization parameter to 0.1, and calculate the R^2 using the test data. (Ridge regression is another regression algorithm where the features are scaled using a regularization parameter alpha. You don't need to bother about this, let's see which model would give a better accuracy score - the Linear Regression or Ridge Regression!)
from sklearn.linear_model import Ridge
RR = Ridge(alpha=0.1)
RR.fit(x_train, y_train) # this is where we are training the model
RR.score(x_test, y_test) # this is where we are passing the test set to calculate accuracy score
0.6478759163939122
We see that a first order Ridge regression model gives less accuracy score than that of Linear Regression model. Let us perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, set the regularisation parameter to 0.1, and calculate the R^2 utilising the test data provided.
Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(degree=2, include_bias=False)),('model',LinearRegression())]
pipe = Pipeline(Input)
pipe.fit(x_train, y_train)
pipe.fit(x_test, y_test)
RR = Ridge(alpha=0.1)
RR.fit(x_train, y_train) # training the model
RR.score(x_test, y_test) # calculating accuracy score
0.6478759163939122
So, which model gives a better accuracy? The Linear Regression model in this case.
We just built a machine learning model that got trained with some data. We calculated its accuracy on the basis of how well it performed on prediction of the unseen data (test set). The world's largest Deep learning model (as of now) is Microsoft's Turing Natural Language Generation (T-NLG). It has got 17 billion parameters (features)! Whereas in this mini project, we had just 11 parameters).
Machine Learning and AI is greatly automating things and is expected to take over multiple jobs in near future!