Contents

1 Learning Sources
- 1.1 Books
- 1.2 Readings & Videos
2 ML 101
- 2.1 Feature engineerings
3 Sklearn modules
4 Regression
- 4.1 Penalized Regression
- 4.2 Ridge Regression
5 SVC
6 KNN
7 Feed forward artificial neural network
8 Random forest
9 Ensemble
10 Decision tree
11 PCA
12 Semi-supervised

Learning Sources

Books

http://www.feat.engineering/

Readings & Videos

machine learning

https://youtu.be/jGwO_UgTS7I?list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU

https://www.youtube.com/@NeoWizard

Machine Learning Roadmap

카이스트 AI 박사

scklearn

https://www.youtube.com/watch?v=eVxGhCRN-xA&pp=ygUQ7YyM7J207I2sIOqwleydmA%3D%3D 
https://youtu.be/eVxGhCRN-xA 
https://youtu.be/pqNCD_5r0IU 
https://www.youtube.com/watch?v=URTZ2jKCgBc&list=PLQVvvaa0QuDd0flgGphKCej-9jp-QdzZ3 
https://www.youtube.com/@NeoWizard 
https://youtu.be/TNcfJHajqJY

ML 101

Feature engineerings

feature engineering – the process of creating representations of data that increase the effectiveness of a model. ( from “Feature Engineering and Selection: A Practical Approach for Predictive Models” Max Kuhn and Kjell Johnson, 2019)

When two features are highly correlated, problems may arise in some models (e.g., simple regressions,

Ttrack outliers in the summary statistics (when the maximum/median or median/minimum ratios seem suspicious).

In the machine learning algorithm, explain the concept of conditional average in feature engineering.

In machine learning, the concept of “conditional average” in feature engineering refers to computing the average value of a target variable or another feature, conditioned on the values of one or more specific features. This technique can be used to create new features that capture the relationship between these features and the target variable.

Steps to Calculate Conditional Average

Identify the Variables:
- Target Variable (Y): The variable for which the conditional average will be computed.
- Conditioning Features (X): The features on which the conditional average of Y will be conditioned.
Group Data by Conditioning Features:
- Partition the dataset based on unique values or ranges of the conditioning features.
Compute the Average:
- For each group formed in the previous step, compute the average value of the target variable (or the feature of interest).
Assign the Conditional Average:
- Create a new feature in the dataset where each instance is assigned the conditional average computed for its group.

Example

Suppose you have a dataset with the following features:

Feature X1: Age of a customer
Feature X2: Gender of a customer
Target Variable Y: Annual spending amount

You want to create a new feature that captures the average annual spending conditioned on the age and gender of the customers.

Step-by-Step Calculation

Identify the Variables:
- Target Variable (Y): Annual spending amount.
- Conditioning Features (X): Age and Gender.
Group Data by Conditioning Features:
- Partition the dataset by unique combinations of Age and Gender.
Compute the Average:
- For each age-gender combination, compute the average annual spending amount.
Assign the Conditional Average:
- Create a new feature, say “Cond_Avg_Spending,” and for each customer, assign the computed average spending amount based on their age and gender group.

Example Dataset

Age	Gender	Annual Spending	Cond_Avg_Spending
25	Male	$2000	$2200
30	Female	$3000	$2800
25	Male	$2400	$2200
30	Female	$2600	$2800

Here, “Cond_Avg_Spending” is the conditional average of annual spending based on the customer’s age and gender.

Applications

Handling Categorical Variables: Creating conditional averages for categorical variables to capture their impact on the target variable.
Reducing Noise: Smoothing noisy features by computing averages over groups.
Capturing Relationships: Enhancing model performance by capturing complex relationships between features and the target variable.

Implementation in Python (Pandas Example)

import pandas as pd

# Sample data

data = {

'Age': [25, 30, 25, 30],

'Gender': ['Male', 'Female', 'Male', 'Female'],

'Annual_Spending': [2000, 3000, 2400, 2600]

}

df = pd.DataFrame(data)

# Compute conditional average

df['Cond_Avg_Spending'] = df.groupby(['Age', 'Gender'])['Annual_Spending'].transform('mean')

print(df)

This will output:

Age  Gender  Annual_Spending  Cond_Avg_Spending

25 Male 2000.0 2200.0

30 Female 3000.0 2800.0

25 Male 2400.0 2200.0

  30  Female  2600.0           2800.0

The new feature “Cond_Avg_Spending” captures the conditional average annual spending based on the age and gender of the customers.

Sklearn modules

모듈	설명
`sklearn.datasets`	내장된 예제 데이터 세트
`sklearn.preprocessing`	다양한 데이터 전처리 기능 제공 (변환, 정규화, 스케일링 등)
`sklearn.feature_selection`	특징(feature)를 선택할 수 있는 기능 제공
`sklearn.feature_extraction`	특징(feature) 추출에 사용
`sklearn.decomposition`	차원 축소 관련 알고리즘 지원 (PCA, NMF, Truncated SVD 등)
`sklearn.model_selection`	교차 검증을 위해 데이터를 학습/테스트용으로 분리, 최적 파라미터를 추출하는 API 제공 (GridSearch 등)
`sklearn.metrics`	분류, 회귀, 클러스터링, Pairwise에 대한 다양한 성능 측정 방법 제공 (Accuracy, Precision, Recall, ROC-AUC, RMSE 등)
`sklearn.pipeline`	특징 처리 등의 변환과 ML 알고리즘 학습, 예측 등을 묶어서 실행할 수 있는 유틸리티 제공
`sklearn.linear_model`	선형 회귀, 릿지(Ridge), 라쏘(Lasso), 로지스틱 회귀 등 회귀 관련 알고리즘과 SGD(Stochastic Gradient Descent) 알고리즘 제공
`sklearn.svm`	서포트 벡터 머신 알고리즘 제공
`sklearn.neighbors`	최근접 이웃 알고리즘 제공 (k-NN 등)
`sklearn.naive_bayes`	나이브 베이즈 알고리즘 제공 (가우시안 NB, 다항 분포 NB 등)
`sklearn.tree`	의사 결정 트리 알고리즘 제공
`sklearn.ensemble`	앙상블 알고리즘 제공 (Random Forest, AdaBoost, GradientBoost 등)
`sklearn.cluster`	비지도 클러스터링 알고리즘 제공 (k-Means, 계층형 클러스터링, DBSCAN 등)

https://www.youtube.com/watch?v=SmZmBKc7Lrs

#Loading the data
from sklearn import datasets 
X, y = datasets.load_wine(return_X_y=True)       # Classification 

from sklearn import datasets 
diabetes = datasets.load_diabetes()              # Regression 
X, y = diabetes.data, diabetes.target

# Split to training and testing
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

#Preprocessing

Standardization 
from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler() 
scaled_X_train = scaler.fit_transform(X_train) 
scaled_X_test = scaler.transform(X_test) 

Normalization 
from sklearn.preprocessing import Normalizer 
norm = Normalizer() 
norm_X_train = norm.fit_transform(X_train) 
norm_X_test = norm.transform(X_test) 

Binarization 
from sklearn.preprocessing import Binarizer 
binary = Binarizer(threshold=0.0) 
binary_X = binary.fit_transform(X) 

Encoding Categorical Features
Encode categorical features with string value 
from sklearn.preprocessing import LabelEncoder 
lab_enc = LabelEncoder()
y = lab_enc.fit_transform(y) 

Imputer 
from sklearn.impute import SimpleImputer 
imp_mean = SimpleImputer(missing_values=0, strategy='mean') 
imp_mean.fit_transform(X_train)

Generating Polynomial Features
from sklearn.preprocessing import PolynomialFeatures 
poly = PolynomialFeatures(5)
poly.fit_transform(X)

# FIT THE SUPERVISED LEARNING MODEL
from sklearn.linear_model import LinearRegression 
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC 

lr = LinearRegression()
lr.fit(X_train, y_train) 

knn.fit(X_train, y_train) 

svm_svc = SVC(kernel='linear')
svm_svc.fit(X_train, y_train)

gnb = GaussianNB()


# PREDICT Supervised Estimators 
y_pred = lr.predict(X_test) 
y_pred = svm_svc.predict(X_test)
y_pred = knn.predict_proba(X_test)             #Estimate probability of a label

# FIT THE UNSUPERVISED LEARNING MODEL
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=0)
kemans_model = k_means.fit(X_train) #Fit the model to the data
pca = PCA(n_components=2)
pca_model = pca.fit_transform(X_train)      #Fit to data, then transform it

# PREDICT Unsupervised Estimators        
y_pred = kmeans_model.predict(X_test)       #Predict labels in clustering algos

# FIT THE NN LEARNING MODEL
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor

# EVAULATION
from sklearn import metrics

Classfication
Accuracy score
lr.score(X_test, y_test) 
from sklearn.metrics import accuracy_score 
accuracy_score(y_test, y_pred) 

Accuracy Score 
>>> knn.score(X_test, y_test)                      #Estimator score method
>>> from sklearn.metrics import accuracy_score     #Metric scoring functions Classification Report
>>> accuracy_score(y_test, y_pred) 

Classification Report 
>>> from sklearn.metrics import classification_report 
>>> print(classification_report(y_test, y_pred))   #Precision, recall, f1-scoreand support 

Confusion Matrix 
>>> from sklearn.metrics import confusion_matrix 
>>> print(confusion_matrix(y_test, y_pred))

Regression Metrics
Mean Squared Error 
from sklearn.metrics import mean_squared_error 
mean_squared_error(y_test, y_pred) 

Mean Absolute Error 
>>> from sklearn.metrics import mean_absolute_error 
>>> y_true = [3, -0.5, 2] 
>>> mean_absolute_error(y_true, y_pred)

R2 
Score from sklearn.metrics import r2_score 
r2_score(y_test, y_pred) 

Clustering Metrics
Adjusted Rand Index 
>>> from sklearn.metrics import adjusted_rand_score 
>>> adjusted_rand_score(y_true, y_pred) 

Homogeneity 
>>> from sklearn.metrics import homogeneity_score 
>>> homogeneity_score(y_true, y_pred) 

V-measure 
>>> from sklearn.metrics import v_measure_score 
>>> metrics.v_measure_score(y_true, y_pred)

# Cross-validation
from sklearn.model_selection import cross_val_score 
cross_val_score(lr, X, y, cv=5, scoring='f1_macro')

from sklearn.cross_validation import cross_val_score
print(cross_val_score(knn, X_train, y_train, cv=4))
print(cross_val_score(lr, X, y, cv=2))

# Model tuning
Grid Search
from sklearn.model_selection import GridSearchCV 
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]} 
model = GridSearchCV(svm_svc, parameters) 
model.fit(X_train, y_train) 
print(model.best_score_) 
print(model.best_estimator_)

>>> from sklearn.grid_search import GridSearchCV 
>>> params = { : np.arange(1,3), : [ , ]} 
>>> grid = GridSearchCV(estimator=knn, param_grid=params) 
>>> grid.fit(X_train, y_train) >>> print(grid.best_score_) 
>>> print(grid.best_estimator_.n_neighbors)

Randomized Parameter Optimization
>>> from sklearn.grid_search import RandomizedSearchCV 
>>> params = { : range(1,5), : [ , ]} 
>>> rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4, n_iter=8,random_state=5) 
>>> rsearch.fit(X_train, y_train) >>> print(rsearch.best_score_)

# Perform variance thresholding on raw features
from sklearn.feature_selection import VarianceThreshold

# replace each attribute’s missing values with the median of that
attribute:

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

# convert these categories from text to numbers
from sklearn.preprocessing import OrdinalEncoder
data_encoded = OrdinalEncoder().fit_transform(data)

# OneHotEncoder class to convert categorical values into one-hot vectors
from sklearn.preprocessing import OneHotEncoder
data_encoded = OneHotEncoder().fit_transform(data)

Regression

# linear regression

# training
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(train_x, train_y)

print ('Intercept: ', regr.intercept_)
print ('Coefficients: ', regr.coef_)

# prediction
test_y_ = regr.predict(test_x)

# evaluatioin
from sklearn.metrics import r2_score

print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y , test_y_) )

# Polynomial regression
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X) 

lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_

# stochastic gradient dscent

from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())

# Ridge
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]]


# Lasso
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]]

# Elastic net
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

# Logistic regresion
from sklearn.linear_model import LogisticRegression
logi_reg = LogisticRegression()
logi_reg.fit(X, y) 
logi_reg.predict([[1.7], [1.5]]) 

# logistic with GridSearch
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
lr = LogisticRegression()
parameters = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
cv = GridSearchCV(lr, parameters, cv=5)
cv.fit(features, labels)

# Softmax regression
softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10)
softmax_reg.fit(X, y) 
softmax_reg.predict([[5, 2]])
softmax_reg.predict_proba([[5, 2]])

Penalized Regression

resort to penalizations to improve the robustness of factor-based predictive regressions. The outcome can then be used to fuel an allocation scheme. For instance, Han et al. (2019) and Rapach and Zhou (2019) use penalized regressions to improve stock return prediction when combining forecasts that emanate from individual characteristics.

Similar ideas can be developed for macroeconomic predictions for instance, as in Uematsu and Tanaka (2019). The second application stems from a less known result which originates from Stevens (1998). It links the weights of optimal mean-variance portfolios to particular cross-sectional regressions. The idea is then different and the purpose is to improve the quality of mean-variance driven portfolio weights. We present the two approaches below after an introduction on regularization techniques for linear models.

Other examples of financial applications of penalization can be found in d’Aspremont (2011), Ban, El Karoui, and Lim (2016) and Kremer et al. (2019). In any case, the idea is the same as in the seminal paper Tibshirani (1996): standard (unconstrained) optimization programs may lead to noisy estimates, thus adding a structuring constraint helps remove some noise (at the cost of a possible bias). For instance, Kremer et al. (2019) use this concept to build more robust mean-variance (Markowitz (1952)) portfolios and Freyberger, Neuhierl, and Weber (2020) use it to single out the characteristics that help explain the cross-section of equity returns.

Source: “Machine Learning for Factor Investing” by Coqueret and Guida, 2023, CH 5, Chapman & Hall

Tibshirani (1996) is the inventor of LASSO: https://webdoc.agsci.colostate.edu/koontz/arec-econ535/papers/Tibshirani%20(JRSS-B%201996).pdf

“The ‘lasso’ minimizes residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant”

$latex \underset{\beta_0, \beta}{\text{minimize}} \left( \frac{1}{2N} \sum_{i=1}^{N} \left( y_i – \beta_0 – \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right)$

Lambda is the penalty term that denotes the amount of shrinkage (or constraint) that will be implemented in the equation. With Lambda set to zero, you will find that this is the equivalent of the linear regression model, and a larger value penalizes the optimization function. Therefore, lasso regression shrinks the coefficients and helps to reduce the model complexity and multi-collinearity. Lambda can be any real-valued number between zero and infinity; the larger the value, the more aggressive the penalization is.

A hyperparameter is used called “lambda” that controls the weighting of the penalty to the loss function. A default value of 1.0 will give full weightings to the penalty; a value of 0 excludes the penalty. Very small values of lambda, such as 1e-3 or smaller, are common.

This has the effect of shrinking the coefficients for those input variables that do not contribute much to the prediction task. This penalty allows some coefficient values to go to the value of zero, allowing input variables to be effectively removed from the model, providing a type of automatic feature selection. One approach to address the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

https://en.wikipedia.org/wiki/Lasso_(statistics)

https://www.datacamp.com/tutorial/tutorial-lasso-ridge-regression

penalizing the magnitude of coefficients of features and minimizing the error between predicted and actual observations. These are called ‘regularization’ techniques.

LASSO is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The LASSO technique uses L1 regularization to penalize the absolute size of the coefficients. This has the effect of setting some coefficients to zero, effectively selecting a simpler model that retains only the most significant features.

Key Points:

Regularization: LASSO adds a penalty equal to the absolute value of the magnitude of coefficients.
Feature Selection: By shrinking some coefficients to zero, LASSO can select a subset of features, which is useful for models with many predictors.
Bias-Variance Tradeoff: Regularization reduces variance but increases bias, which can lead to better model generalization.

LASSO (Least Absolute Shrinkage and Selection Operator) and OLS (Ordinary Least Squares) are both linear regression methods, but they differ significantly in how they handle model complexity and feature selection.

Objective: OLS aims to minimize the sum of squared residuals (the difference between observed and predicted values).

Penalty: OLS does not apply any penalty to the coefficients. As a result, it can lead to overfitting, especially when dealing with many predictors or highly correlated predictors.

Feature Selection: OLS does not perform feature selection. All predictors are included in the final model, regardless of their significance.

Objective: LASSO aims to minimize the sum of squared residuals with an added penalty proportional to the sum of the absolute values of the coefficients.

Penalty: LASSO applies an L1 penalty, which shrinks some coefficients to zero. This can help prevent overfitting and improve the model’s ability to generalize to new data.

Feature Selection: LASSO performs feature selection by setting some coefficients exactly to zero. This means that it effectively selects a simpler model that includes only the most relevant predictors.

By introducing regularization, LASSO balances the complexity of the model and improves its performance on unseen data, addressing the overfitting issue inherent in OLS.

# from Chat GPT
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

# Sample financial dataset
np.random.seed(0)
n_samples, n_features = 100, 10

X = np.random.randn(n_samples, n_features)
true_coeffs = np.array([3.5, -2, 0, 4.5, 0, 0, 1.2, 0, 0, -1])
y = X.dot(true_coeffs) + np.random.normal(size=n_samples)

# Convert to DataFrame for better visualization (optional)
df = pd.DataFrame(X, columns=[f'Feature_{i+1}' for i in range(n_features)])
df['Target'] = y

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Fit LASSO model
lasso = Lasso(alpha=0.1) # alpha is the regularization strength
lasso.fit(X_train, y_train)

# Predict and evaluate
y_pred_train = lasso.predict(X_train)
y_pred_test = lasso.predict(X_test)

# Print coefficients
print("Coefficients:", lasso.coef_)

# Evaluate the model
print("Training MSE:", mean_squared_error(y_train, y_pred_train))
print("Testing MSE:", mean_squared_error(y_test, y_pred_test))
print("Training R2:", r2_score(y_train, y_pred_train))
print("Testing R2:", r2_score(y_test, y_pred_test))

output:

Coefficients: [ 3.49900737 -1.88459788 -0.04363231 4.91300606 -0.07362807 -0. 1.06434658 0. 0.05968488 -0.89372899] 
Training MSE: 0.6769542836172144 
Testing MSE: 0.873434699350053 
Training R2: 0.9854216938258963 
Testing R2: 0.9835558543877939

Ridge Regression:
- Performs L2 regularization, i.e., adds penalty equivalent to the square of the magnitude of coefficients
- Minimization objective = LS Obj + α * (sum of square of coefficients)
Lasso Regression:
- Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the magnitude of coefficients
- Minimization objective = LS Obj + α * (sum of the absolute value of coefficients)

Ridge Regression

Similar to the lasso regression, ridge regression puts a similar constraint on the coefficients by introducing a penalty factor. However, while lasso regression takes the magnitude of the coefficients, ridge regression takes the square.

$latex \hat{\beta}^{ridge} = \arg \min_{\beta} \left( \sum_{i=1}^{n} (y_i – \beta_0 – \sum_{j=1}^{p} x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \right)$

Why Lasso can be Used for Model Selection, but not Ridge Regression

Source: https://online.stat.psu.edu/stat508/book/export/html/749

Considering the geometry of both the lasso (left) and ridge (right) models, the elliptical contours (red circles) are the cost functions for each. Relaxing the constraints introduced by the penalty factor leads to an increase in the constrained region (diamond, circle). Doing this continually, we will hit the center of the ellipse, where the results of both lasso and ridge models are similar to a linear regression model.

However, both methods determine coefficients by finding the first point where the elliptical contours hit the region of constraints. Since lasso regression takes a diamond shape in the plot for the constrained region, each time the elliptical regions intersect with these corners, at least one of the coefficients becomes zero. This is impossible in the ridge regression model as it forms a circular shape and therefore values can be shrunk close to zero, but never equal to zero.

Source: https://www.datacamp.com/tutorial/tutorial-lasso-ridge-regression

https://www.datacamp.com/datalab/w/e09a2743-6536-4d44-88e9-088be0edc682

SVC

# SVC
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
svc = SVC()
parameters = {'kernel': ['linear', 'rbf'],'C': [0.1, 1, 10]}
cv = GridSearchCV(svc, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

KNN

# KNN 
# preprocessing
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)

# training
from sklearn.neighbors import KNeighborsClassifier
k = 4
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh


# prediction
yhat = neigh.predict(X_test)

# evaluation
sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

# KNN example
from sklearn import neighbors, datasets, preprocessing 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 

iris = datasets.load_iris() 
X, y = iris.data[:, :2], iris.target 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) 
scaler = preprocessing.StandardScaler().fit(X_train) 
X_train = scaler.transform(X_train) 
X_test = scaler.transform(X_test) 

knn = neighbors.KNeighborsClassifier(n_neighbors=5) 
knn.fit(X_train, y_train) 

y_pred = knn.predict(X_test) 
accuracy_score(y_test, y_pred)

output: 
0.631578947368421

Feed forward artificial neural network

# multi-layer perceptron - classical feed forward artificial neural network
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier()
parameters = {'hidden_layer_sizes': [(10,), (50,), (100,)],
     'activation': ['relu', 'tanh', 'logistic'],
     'learning_rate': ['constant', 'invscaling', 'adaptive']
}
cv = GridSearchCV(mlp, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())
cv.best_estimator_

Random forest

“Random forests are applicable to a wide range of problems—you could say that they’re almost always the second-best algorithm for any shallow machine learning task. When the popular machine learning competition website Kaggle (http://kaggle.com) got started in 2010, random forests quickly became a favorite on the platform—until 2014, when gradient boosting machines took over. A gradient boosting machine, much like a random forest, is a machine learning technique based on ensembling weak prediction models, generally decision trees. It uses gradient boosting, a way to improve any machine learning model by iteratively training new models that specialize in addressing the weak points of the previous models. Applied to decision trees, the use of the gradient boosting technique results in models that strictly outperform random forests most of the time, while having similar properties. It may be one of the best, if not the best, algorithm for dealing with nonperceptual data today. Alongside deep learning, it’s one of the most commonly used techniques in Kaggle competitions. (from Chollet, 2017, Ch1)

# Random Forest - a collection of indepedent dection trees to improve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 250],
    'max_depth': [2, 4, 8, 16, 32, None]
}
cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

Ensemble

# Ensemble - Boosting 
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
gb = GradientBoostingClassifier()
parameters = {
    'n_estimators': [5, 50, 250, 500],
    'max_depth': [1, 3, 5, 7, 9],
    'learning_rate': [0.01, 0.1, 1, 10, 100]
}
cv = GridSearchCV(gb, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

Decision tree

# decision tree
Versitile algorithm, can perform both classification and regression tasks, also the 
fundamental component of Random Forests, dont'require feature scaling or cetering
at all, 


from sklearn.tree import DecisionTreeRegressor

# Split our data into a training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)

# training
model = DecisionTreeRegressor(criterion = 'mse')
model.fit(X_train, Y_train)
# evaluation / prediction
model.score(X_test, Y_test)
prediction = model.predict(X_test)
print("$",(prediction - Y_test).abs().mean()*1000)

from IPython.display import SVG
from graphviz import Source

graph = Source(tree.export_graphviz(model, out_file=None, feature_names=features))
SVG(graph.pipe(format='svg'))

K means

# K means - clustering
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers[['Income', 'SpendingScore']])

from sklearn.cluster import KMeans
km = KMeans(n_clusters = 3, n_init = 25, random_state = 1234)

km.labels_
km.inertia_
km.cluster_centers_

# determining k
wcss=[]; wcss.append(km.inertia_)
silhouette=[]; silhouette.append(silhouette_score(customers_scaled, km.labels_))
calinski=[]; calinski.append(calinski_harabasz_score(customers_scaled, km.labels_))

Professor Ha-Chin Yi Home Page

“You can always find the sun within yourself if you will only search.” — Maxwell Maltz

Machine Learning