Sweden
Breadcrumb navigation
Example of training time reduction for a classifier
Technical Articles(decision tree, gradient boosting decision treeMay dd, 2022
Shoichiro Yokotani, Application Development Expert
AI Platform division
In this column, I will present an example of analysis using the Frovedis machine learning algorithm.
In machine learning algorithms, supervised learning can be categorized into two types: regression and classification. In this article, we will take the latter, classification, as an example and run a sample using the Frovedis learning algorithm. We will also compare the time required for learning between the Frovedis version and the scikit-learn version.
Classifiers are applied to datasets with discrete output y for many input variables. For example, the Credit Card Fraud Detection dataset used in this study has 29 different features with binary output y, such as Not Fraud and Fraud. We will use this data set to make a two-class decision using a machine learning algorithm.
Typical machine learning algorithms for classifications include logistic regression, linear support vector machines, random forests as an ensemble method of classification trees and classification trees, and gradient boosting classification trees. In this column, we will focus on two-class classification using classification trees and gradient boosting classification trees.
Gradient boosting decision trees can be used for class classification. scikit-learn version of gradient boosting decision trees does not perform parallel processing, but Frovedis version processes each decision tree creation in parallel, so it is expected to reduce training time compared to scikit-learn on very large datasets. The Frovedis version is expected to reduce training time compared to scikit-learn on very large datasets.
Supervised Learning: A Learning Algorithm for Two-Class Classification¶
Two-class classification using classification trees and gradient boosting classification trees (training time comparison between scikit-learn version and Frovedis version)¶
Dataset: Credit Card Fraud Detection https://www.kaggle.com/mlg-ulb/creditcardfraud¶
Loading the dataset¶
import numpy as np
import pandas as pd
df = pd.read_csv('../../data/classify/creditcard.csv')
class_names = {0:'Not Fraud', 1:'Fraud'}
print(df.Class.value_counts().rename(index = class_names))
data_features = df.drop(['Time', 'Class'], axis=1).values
data_target = df['Class'].values
Not Fraud 284315 Fraud 492 Name: Class, dtype: int64
df.drop(['Time', 'Class'], axis=1).head()
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | 0.090794 | ... | 0.251412 | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 |
1 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | -0.166974 | ... | -0.069083 | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 |
2 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | 0.207643 | ... | 0.524980 | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 |
3 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | -0.054952 | ... | -0.208038 | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 |
4 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | 0.753074 | ... | 0.408542 | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 |
5 rows × 29 columns
Partitioning the data set for training and inference¶
from sklearn.model_selection import train_test_split
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(data_features, data_target, train_size=0.70, test_size=0.30, random_state=1)
Learning and inference with the Frovedis version of Classification Trees¶
import os, time
from frovedis.exrpc.server import FrovedisServer
from frovedis.mllib.tree import DecisionTreeClassifier as frovDecisionTreeClassifier
FrovedisServer.initialize("mpirun -np 6 {}".format(os.environ['FROVEDIS_SERVER']))
fdtc = frovDecisionTreeClassifier(max_depth=8)
t1 = time.time()
fdtc.fit(X_train, y_train)
t2 = time.time()
print ("train time: {:.3f} sec".format(t2-t1))
train time: 0.262 sec
Displaying inference results¶
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score, recall_score
pred = fdtc.predict(X_test)
cmat = confusion_matrix(y_test, pred)
tpos = cmat[0][0]
fneg = cmat[1][1]
fpos = cmat[0][1]
tneg = cmat[1][0]
f1Score = round(f1_score(y_test, pred), 2)
recallScore = round(recall_score(y_test, pred), 2)
print('confusion matrix:')
print(cmat)
print('Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
print("Recall : {recall_score}".format(recall_score = recallScore))
print("F1 Score : {f1_score}".format(f1_score = f1Score))
FrovedisServer.shut_down()
confusion matrix: [[85292 16] [ 34 101]] Accuracy: 99.94% Recall : 0.75 F1 Score : 0.8
Learning with scikit-learn version of classification tree¶
import os, time
from sklearn.tree import DecisionTreeClassifier as skDecisionTreeClassifier
sdtc = skDecisionTreeClassifier(max_depth=8)
t1 = time.time()
sdtc.fit(X_train, y_train)
t2 = time.time()
print ("train time: {:.3f} sec".format(t2-t1))
train time: 7.809 sec
Displaying inference results¶
pred = sdtc.predict(X_test)
cmat = confusion_matrix(y_test, pred)
tpos = cmat[0][0]
fneg = cmat[1][1]
fpos = cmat[0][1]
tneg = cmat[1][0]
f1Score = round(f1_score(y_test, pred), 2)
recallScore = round(recall_score(y_test, pred), 2)
print('confusion matrix:')
print(cmat)
print('Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
print("Recall : {recall_score}".format(recall_score = recallScore))
print("F1 Score : {f1_score}".format(f1_score = f1Score))
confusion matrix: [[85295 13] [ 37 98]] Accuracy: 99.94% Recall : 0.73 F1 Score : 0.8
Training and inference with Frovedis version of gradient boosting classification tree¶
from frovedis.mllib.ensemble import GradientBoostingClassifier
FrovedisServer.initialize("mpirun -np 6 " + os.environ["FROVEDIS_SERVER"])
fgb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.01)
t1 = time.time()
fgb.fit(X_train, y_train)
t2 = time.time()
print ("train time: {:.3f} sec".format(t2-t1))
train time: 16.839 sec
Displaying inference results¶
pred = fgb.predict(X_test)
cmat = confusion_matrix(y_test, pred)
tpos = cmat[0][0]
fneg = cmat[1][1]
fpos = cmat[0][1]
tneg = cmat[1][0]
f1Score = round(f1_score(y_test, pred), 2)
recallScore = round(recall_score(y_test, pred), 2)
print('confusion matrix:')
print(cmat)
print('Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
print("Recall : {recall_score}".format(recall_score = recallScore))
print("F1 Score : {f1_score}".format(f1_score = f1Score))
FrovedisServer.shut_down()
confusion matrix: [[85300 8] [ 35 100]] Accuracy: 99.95% Recall : 0.74 F1 Score : 0.82
Learning and inference with scikit-learn's version of gradient boosting classification trees¶
from sklearn.ensemble import GradientBoostingClassifier
sgb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.01)
t1 = time.time()
sgb.fit(X_train, y_train)
t2 = time.time()
print ("train time: {:.3f} sec".format(t2-t1))
train time: 1398.770 sec
Displaying inference results¶
pred = sgb.predict(X_test)
cmat = confusion_matrix(y_test, pred)
tpos = cmat[0][0]
fneg = cmat[1][1]
fpos = cmat[0][1]
tneg = cmat[1][0]
f1Score = round(f1_score(y_test, pred), 2)
recallScore = round(recall_score(y_test, pred), 2)
print('confusion matrix:')
print(cmat)
print('Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
print("Recall : {recall_score}".format(recall_score = recallScore))
print("F1 Score : {f1_score}".format(f1_score = f1Score))
confusion matrix: [[85297 11] [ 37 98]] Accuracy: 99.94% Recall : 0.73 F1 Score : 0.8
Displaying Inference Results on Frovedis Version of Gradient Boosting Classification Trees by PCA Dimensional Compression¶
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_test)
X_scaled = scaler.transform(X_test)
from frovedis.mllib.ensemble import GradientBoostingClassifier
from frovedis.decomposition import PCA
FrovedisServer.initialize("mpirun -np 6 " + os.environ["FROVEDIS_SERVER"])
from frovedis.mllib.ensemble import GradientBoostingClassifier
fgb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.01)
fgb.fit(X_train, y_train)
pca = PCA(n_components=2)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
data = {'PCA1': X_pca[:,0], 'PCA2': X_pca[:,1], 'Target': y_test, 'Test': fgb.predict(X_test)}
df = pd.DataFrame(data)
df.head()
PCA1 | PCA2 | Target | Test | |
---|---|---|---|---|
0 | 0.587710 | -0.102335 | 0 | 0 |
1 | 0.510780 | -0.442828 | 0 | 0 |
2 | 0.466657 | -0.192914 | 0 | 0 |
3 | 0.354354 | 0.037564 | 0 | 0 |
4 | 0.471092 | -0.166892 | 0 | 0 |
import matplotlib.pyplot as plt
df_0 = df[(df.Target==0) & (df.Test==0)]
df_1 = df[(df.Target==1) & (df.Test==1)]
df_test_0 = df[(df.Target==1) & (df.Test==0)]
df_test_1 = df[(df.Target==0) & (df.Test==1)]
plt.figure(figsize=(10,7))
plt.scatter(df_0['PCA1'], df_0['PCA2'], color='r', s=2, label='Not Fraud')
plt.scatter(df_1['PCA1'], df_1['PCA2'], color='g', s=2, label='Fraud')
plt.scatter(df_test_0['PCA1'], df_test_0['PCA2'], color='b', marker='x', s=100, label='False: Fraud->Not Fraud')
plt.scatter(df_test_1['PCA1'], df_test_1['PCA2'], color='y', marker='x', s=100, label='False: Not Fraud->Fraud')
plt.xlabel("PC1")
plt.ylabel('PC2')
plt.legend()
plt.show()
FrovedisServer.shut_down()
In these samples using the Credit Card Fraud Detection dataset, we first show the classification tree using the Frovedis and scikit-learn versions, and then the analysis using the gradient boosting classification tree. Finally, the classification results are graphed using PCA feature reduction. The comparison of training time between Frovedis and scikit-learn learning algorithms is shown in the table below.
Learning algorithms | Frovedis (sec) | scikit-learn (sec) | Ratio |
---|---|---|---|
Classification Trees | 0.26 | 7.81 | x30.0 |
Gradient boosting | 16.84 | 1398.77 | x83.1 |
The optimization and cross-validation of large data sets by iteratively training with varying machine learning parameters can be time consuming in many cases. In cases where data with new characteristics are frequently added, time-consuming re-training is repeated.
By using Frovedis' parallelized algorithms on SX-Aurora TSUBASA, high performance training models can be prepared frequently and quickly, reducing the cost of system development and maintenance.