Table of Contents

    Hands-On Guide to Structured Data Classification

    Structured Data Classification - Hands On Solution

    Run the Cell to import the packages

    
    import pandas as pd
    import numpy as np
    #import dataframe as df
    

    Data Loading Fill in the Command to load your CSV dataset "weather.csv" with pandas

    
    weather = pd.read_csv('weather.csv', sep=',')
    

    Data Analysis

    • Get the shape of the dataset and print it.

    • Get the column names in list and print it.

    • Describe the dataset to understand the basic statistics of the dataset.

    • Print the first three rows of the dataset

    
    data_size=weather.shape
    
    print(data_size)
    
    weather_col_names = list(weather.columns)
    
    print(weather_col_names)
    
    print(weather.describe())
    
    print(weather.head(3))
    

    Target Identification

    Execute the below cell to identify the target variables. If yes it will Rain Tommorow otherwise it will not Rain.

    
    weather_target=weather['RainTomorrow'] 
    
    print(weather_target)
    

    Feature Identification

    In our case by analyzing the dataset, we can understand that the columns like Date might be irrelevant as they are not dependent on call usage pattern.

    Since RainTomorrow is our target variable, we will be removing it from the feature set.

    • Perform appropriate operation to drop the columns Date and RainTomorrow
    
    cols_to_drop = ['Date','RainTomorrow']
    
    weather_feature = weather.drop(cols_to_drop,axis = 1)
    
    print(weather_feature.head(5))
    

    Categorical Data

    In order to Identify the categorical variable in a data, use the following command in the below cell,

    
    weather_categorical = weather.select_dtypes(include=[object])
    
    print(weather_categorical.head(15))
    

    Convert to boolean

    Assign the column RainToday for the variable yes_no_cols and run the below cell to print first 5 rows of weather_feature

    
    yes_no_cols = ["RainToday"]
    
    weather_feature[yes_no_cols] = weather_feature[yes_no_cols] == 'Yes'
    
    print(weather_feature.head(5))
    

    One Hot Encoding

    Execute the below cells to perform One Hot Encoding

    
    weather_dumm=pd.get_dummies(weather_feature, columns=["Location","WindGustDir","WindDir9am","WindDir3pm"], prefix=["Location","WindGustDir","WindDir9am","WindDir3pm"])
    
    weather_matrix = weather_dumm.values.astype(np.float)
    

    Imputing-Missing Values

    Do the Imputing-Missing Values by using the following parameters

    • missing_values=np.nan
    • strategy=mean
    • fill_value=None
    • verbose=0
    • copy=True
    
    from sklearn.impute import SimpleImputer
    
    imp=SimpleImputer(missing_values=np.nan,strategy='mean', fill_value=None,verbose=0,copy=True)
    
    weather_matrix=imp.fit_transform(weather_matrix)
    

    Standardization

    Run the below cell to perform standardization

    
    from sklearn.preprocessing import StandardScaler
    
    #Standardize the data by removing the mean and scaling to unit variance
    
    scaler = StandardScaler()
    
    #Fit to data, then transform it.
    
    weather_matrix = scaler.fit_transform(weather_matrix)
    

    Train and Test Data

    Splitting the data for training and testing(90% train,10% test)

    • Perform train-test split on weather_matrix and weather_target with 90% as train data and 10% as test data and set random_state as seed.
    
    from sklearn.model_selection import train_test_split
    
    seed=5000
    
    train_data,test_data, train_label, test_label = train_test_split(weather_matrix,weather_target,test_size=0.1,random_state = seed)
    

    Decision Tree Classification

    • Initialize SVM classifier with following parameters

      • kernel = linear
      • C= 0.025
      • random_state=seed
    • Train the model with train_data and train_label

    • Now predict the output with test_data

    • Evaluate the classifier with score from test_data and test_label

    • Print the predicted score

    
    from sklearn.svm import SVC
    
    classifier = SVC(kernel="linear",C=0.025,random_state=seed )
    
    classifier = classifier.fit(train_data,train_label)
    
    churn_predicted_target=classifier.predict(test_data)
    
    score = classifier.score(test_data,test_label)
    
    print('SVM Classifier : ',score)
    
    with open('output.txt', 'w') as file:
    
        file.write(str(np.mean(score)))
    

    Random Forest Classifier

    • Do the Random Forest Classifier of the Dataset using the following parameters.

      • max_depth=5
      • n_estimators=10
      • max_features=10
      • random_state=seed
    • Train the model with train_data and train_label.

    • Now predict the output with test_data.

    • Evaluate the classifier with score from test_data and test_label.

    
    from sklearn.ensemble import RandomForestClassifier
    
    classifier = RandomForestClassifier(max_depth=5,n_estimators=10,max_features=10,random_state=seed)
    
    classifier = classifier.fit(train_data,train_label)
    
    churn_predicted_target=classifier.predict(test_data)
    
    score = classifier.score(test_data,test_label)
    
    print('Random Forest Classifier : ',score)
    
    with open('output1.txt', 'w') as file:
    
        file.write(str(np.mean(score)))