2. Data preprocessing using scikit learn| California Housing Prices dataset
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. There are numbers of methodologies of data preprocessing but our main focus is toward :
(1)Data Encoding
(2)Normalization
(3)Standardization
(4) Imputing the Missing Values
(5) Discretization
Dataset Description
Here i have used ‘California Housing Prices dataset’. This dataset contains information about longitude, latitude of ocean proximity area, population, number of beds, number of rooms, house price etc…
This dataset contains numeric as well as categorical data. Dataset also has different scaled columns and contains missing values. So this is the perfect dataset for preprocessing.
Dataset: California Housing Prices dataset
Data Encoding
Data encoding is the transformation of categorical variables to binary or numerical counterparts. In this we assign unique values to all the categorical attribute. An example is to treat male or female for gender as 1 or 0. so there are two types so data encoding (1)label encoding (2)Onehot encoding
(1)Label encoding
If we will have more than one category in the dataset that to convert those categories into numerical features we can use a Label encoder. Label Encoder will assign a unique number to each category.
As you can see ‘median_house_value’ column has 3842 categories that is nothing but house ranges. After Using Label Encoder we labeled the data. The 500001 housing range converted to 3841, 137500 housing range converted to 959, 162500 housing range converted to 1209 and so on…
classes_ attribute is helping us to identify numerical categories for particular label categories. ( 0 index: 14999 house range, 1 index: 17500 house range…)
(2)Onehot encoder
One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories. So if you have 3 categories in the column then one hot encoder will add 3 more columns to your dataset.
Now it totally depends on the dataset and its behavior. One Hot Encoder will increase the dimensional but it is useful most time because in the label encoder sometimes all the numerical categories will compare with each other by machine so it will make wrong assumptions. So that’s why OneHot is used more in the real world. But I advise you to do an experiment with both.
Normalization
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. because in real-world data is not available on the same scale. Data columns will always have different scales. So to make all the columns in one scale we can use normalization methods.
MinMaxScaler : For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.
Standardization
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation(i.e. standard deviation = 1).
Imputing Missing Values
Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing of an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).
We can handle missing values in two ways. : (1) Remove the data (whole row) which have missing values.(2) Add the values by using some strategies or using Imputer.
Simple Imputer
Discretization
Data discretization is the process of converting continuous data into discrete buckets by grouping it. by doing this we can limit the number of possible states. basically we convert the numerical features into categorical columns.
There are 3 types of Discretization available in Sci-kit learn.(1) Quantile Discretization Transform (2) Uniform Discretization Transform (3) KMeans Discretization Transform
Quantile Discretization Transform
Uniform Discretization Transform
KMeans Discretization Transform
Sourse Code : Github