Contents

Learning Sources

IBM courses from Coursera
Follow the links below to learn more about each of the AI Engineering Professional Certificate series of courses and see how these programs can benefit you and advance your career.

Machine Learning with Python https://www.coursera.org/learn/machine-learning-with-python?specialization=ai-engineer

Introduction to Deep Learning & Neural Networks with Keras https://www.coursera.org/learn/introduction-to-deep-learning-with-keras?specialization=ai-engineer

Deep Learning  withKeras and Tensorflow https://www.coursera.org/learn/deep-learning-with-keras-and%20tensorflow?specialization=ai-engineer

Deep Learning with PyTorch

AI Capstone Project with Deep Learning 

Generative AI and LLMs: Architecture and Data Preparation

Gen AI Model Foundations for NLP & Language Understanding

Generative AI Language Modeling with Transformers

Generative AI Engineering and Fine-Tuning Transformers

https://www.coursera.org/learn/generative-ai-advance-fine-tuning-for-llmsGenerative AI Advance Fine-Tuning for LLMs

https://www.coursera.org/teach/course-6-gen-ai-tbd/course/overviewFundamentals of Building AI Agents using RAG and LangChain

Project: Generative AI with RAG and LangChain

deep learning - in general  

https://github.com/fchollet/deep-learning-with-python-notebooks
https://github.com/ageron/handson-ml2
Dive into Deep Learning — Dive into Deep Learning 1.0.3 documentation
https://youtu.be/DooxDIRAkPA 
https://youtu.be/dafuAz_CV7Q 
https://youtu.be/VyWAvY2CF9c 
https://youtu.be/WHvWSYKGMDQ 
https://www.learnpytorch.io/
https://dev.mrdbourke.com/tensorflow-deep-learning/
https://github.com/mrdbourke/tensorflow-deep-learning/
https://github.com/mrdbourke/pytorch-deep-learning
https://www.youtube.com/channel/UCbfYPyITQ-7l4upoX8nvctg 
https://www.youtube.com/c/3blue1brown 
https://www.youtube.com/@lexfridman
Neural Networks and Deep Learning Book by Michael Nielsen 

For Machine learning or deep learning history & new developments - watch this MIT lecture
By the way, Lex Friman recommend this site for NLP
https://github.com/sebastianruder/NLP-progress

   

           

keras

https://realpython.com/python-keras-text-classification/ 
https://towardsdatascience.com/building-our-first-neural-network-in-keras-bdc8abbc17f5
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
https://keras.io/examples/
https://keras.io/api/applications/

  

tensorflow keras 

https://www.youtube.com/watch?v=B961QM47g64 
https://youtu.be/28QbrkRkHlo
https://www.youtube.com/watch?v=Y__gyApx_7c&t=5231s 
https://youtu.be/tPYj3fFJGjk 
https://youtu.be/VtRLrQ3Ev-U 
https://www.youtube.com/watch?v=tpCFfeUEGs8&t=10767s 
https://www.youtube.com/watch?v=ZUKz4125WNI


      

PyTorch

https://www.learnpytorch.io/
https://github.com/mrdbourke/pytorch-deep-learning
https://pytorch.org/tutorials/beginner/basics/intro.html
https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html

    

Sequence model 
https://youtu.be/L8HKweZIOmg

RNN
https://youtu.be/6niqTuYFZLQ 
https://youtu.be/ySEx_Bqxvvo 
https://youtu.be/S7oA5C43Rbc 
https://www.ibm.com/topics/recurrent-neural-networks 
https://aws.amazon.com/what-is/recurrent-neural-network/ 
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks 

LTSM
https://youtu.be/YCzL96nL7j0

transformer

https://huggingface.co/docs/transformers/en/index
https://youtu.be/XfpMkf4rD6E 
https://youtu.be/eMlx5fFNoYc 
https://blogs.nvidia.com/blog/what-is-a-transformer-model/ 
https://aws.amazon.com/what-is/transformers-in-artificial-intelligence/ 
https://www.datacamp.com/tutorial/how-transformers-work 
https://www.youtube.com/watch?v=LWMzyfvuehA 
https://www.youtube.com/watch?v=XowwKOAWYoQ 
https://www.youtube.com/watch?v=bCz4OMemCcA

 

NLP using deep learning

https://www.youtube.com/watch?v=Hn3GHHOXKCE 
https://www.youtube.com/watch?v=Rf7wvs8ZbP4&t=1853s

Warm-ups

Neural networks: use the examples to automatically infer rules for recognizing patterns.

Perceptrons ( https://images.app.goo.gl/DBBjf95jCtdB4ckG7 ) were developed in 1950s by Frank Rosenblatt. The neuron’s output, 0 or 1 is determined by whether the weighted sum is less than or greater than some threshold value.

the best way to improve a deep learning model is to train it on more data or better data

Previous machine learning techniques—shallow learning—only involved transforming the input data into one or two successive representation spaces, usually via simple transformations such as high-dimensional non-linear projections (SVMs) or decision trees. But the refined representations required by complex problems generally can’t be attained by such techniques. As such, humans had to go to great lengths to make the initial input data more amenable to processing by these methods: they had to manually engineer good layers of representations for their data. This is called feature engineering. Deep learning, on the other hand, completely automates this step: with deep learning, you learn all features in one pass rather than having to engineer them yourself. This has greatly simplified machine learning workflows, often replacing sophisticated multistage pipelines with a single, simple, end-to-end deep learning model.(From Chollet, 2017, Ch1)

What happened is that the gaming market subsidized supercomputing for the next generation of artificial intelligence applications. Sometimes, big things begin as games. Today, the NVIDIA Titan RTX, a GPU that cost $2,500 at the end of 2019, can deliver a peak of 16 teraFLOPS in single precision (16 trillion float32 operations per second). That’s about 500 times more computing power than the world’s fastest supercomputer from 1990, the Intel Touchstone Delta. On a Titan RTX, it takes only a few hours to train an ImageNet model of the sort that would have won the ILSVRC competition around 2012 or 2013. Meanwhile, large companies train deep learning models on clusters of hundreds of GPUs. (From Chollet, 2017, Ch1)

The most popular NVIDIA GPUs for deep learning as of 2024 are the NVIDIA GeForce RTX 3090 and the RTX 4090, depending on the specific requirements and budget considerations.

As of 2024, the prices for the NVIDIA GPUs popular for deep learning vary significantly:The NVIDIA GeForce RTX 3090 is priced around $1,114, but you can find it for slightly less depending on sales and availability (Tom’s Hardware). The NVIDIA RTX 4090 is generally more expensive, costing between $1,650 and $2,178 (Tom’s Hardware

Scalability—Deep learning is highly amenable to parallelization on GPUs or TPUs, so it can take full advantage of Moore’s law. In addition, deep learning models are trained by iterating over small batches of data, allowing them to be trained on datasets of arbitrary size. (The only bottleneck is the amount of parallel computational power available, which, thanks to Moore’s law, is a fast-moving barrier.) (From Chollet, 2017, Ch1)

Picking the right network architecture is more an art than a science; and although there are some best practices and principles you can rely on, only practice can help you become a proper neural-network architect. the three most common use cases of neural networks: binary classification, multiclass classification, and scalar regression.

At its core, a tensor is a container for data—almost always numerical data. So, it’s a container for numbers. You may be already familiar with matrices, which are 2D tensors: tensors are a generalization of matrices to an arbitrary number of dimensions (note that in the context of tensors, a dimension is often called an axis). (From Chollet, 2017, Ch2)

The test-set accuracy turns out to be 97.8%—that’s quite a bit lower than the training set accuracy. This gap between training accuracy and test accuracy is an example of overfitting: the fact that machine-learning models tend to perform worse on new data than on their training data. (From Chollet, 2017, Ch2)

Vector data— 2D tensors of shape (samples, features) Timeseries data or sequence data— 3D tensors of shape (samples, timesteps, features) Images— 4D tensors of shape (samples, height, width, channels) or (samples, channels, height, width) Video— 5D tensors of shape (samples, frames, height, width, channels) or (samples, frames, channels, height, width) (From Chollet, 2017, Ch2) Tensors, including special tensors that store the network’s state (variables) Tensor operations such as addition, relu, matmul Backpropagation, a way to compute the gradient of mathematical expressions (handled in TensorFlow via the GradientTape object) Layers, which are combined into a model A loss function, which defines the feedback signal used for learning An optimizer, which determines how learning proceeds Metrics to evaluate model performance, such as accuracy A training loop that performs mini-batch stochastic gradient descent (From Chollet, 2021, Ch3)

Why non-linear activation function?
Without non-linear activation functions, the entire neural network would behave like a 
single-layer perceptron, regardless of its depth. This is because a composition of 
linear  functions is still a linear function. Non-linear activation functions allow 
neural networks to learn complex, non-linear relationships between inputs and outputs, 
which is crucial for modeling real-world data that is inherently non-linear.
Affine transformation is a linear mapping method that preserves points, straight lines, 
and planes.

Why do we need an optimize in deep learning?
In deep learning, an optimizer is crucial because it determines how the model learns 
and improves from the data it is trained on. Determines how the network will be updated 
based on the loss function. It implements a specific variant of stochastic gradient 
descent (SGD).
Different optimizers converge at different speeds for different kinds of data and models. 
For example, some optimizers like Adam are known for being fast to converge in many 
scenarios compared to traditional methods like stochastic gradient descent (SGD).

Minimizing Loss Function: The primary role of an optimizer is to minimize the loss 
function, which measures the difference between the predicted outputs of the model and 
the actual values. By minimizing the loss, the optimizer helps the model to improve its 
accuracy and performance.
Without an optimizer, the training process might not converge, meaning the model might 
not reach a point where the loss is minimized adequately. Optimizers guide the 
learning process to ensure that it converges to an optimal solution within a reasonable 
time frame.

What is the difference between optimizer and objective function

Optimizer and Objective Function are two related but distinct concepts in Machine 
Learning and Optimization:

Objective Function (also called Loss Function or Cost Function):
efines the goal or objective of the optimization problem
Measures the difference between the model's predictions and the actual true labels
Examples: Mean Squared Error (MSE), Cross-Entropy, Mean Absolute Error (MAE)

Optimizer:
A algorithm that searches for the optimal parameters of a model
Adjusts the model's parameters to minimize the Objective Function
Examples: Stochastic Gradient Descent (SGD), Adam, RMSprop, Gradient Descent (GD)

In summary:
The Objective Function defines what we want to optimize (e.g., minimize the error)
The Optimizer is the algorithm that performs the optimization (e.g., adjusts the 
model's parameters to minimize the error)
For instance, you’ll use binary cross-entropy for a two-class classification problem,
categorical cross-entropy for a many-class classification problem, mean-squared error 
for a regression problem, connectionist temporal classification (CTC) for a 
sequence-learning problem, and so on. Only when you’re working on truly new 
research problems will you have to develop your own objective functions.

What is a learning decay?
In deep learning, learning rate decay, or simply learning decay, refers to the 
technique of reducing the learning rate over time during training. The learning 
rate is a crucial hyperparameter that determines the size of the steps that the 
optimizer takes towards the minimum of the loss function. By adjusting the learning
rate throughout the training process, learning rate decay aims to achieve more 
effective and reliable training 
results.

Choosing the right activation function, such as sigmoid, softmax, or others, depends 
on the specific requirements of your neural network's architecture and the nature 
of the problem you're trying to solve. Here's a breakdown of when to use each:

Sigmoid Activation Function
- Range: Outputs values between 0 and 1.
- Use Cases: 
      - Binary Classification: Commonly used in the output layer when the task is binary 
        classification. It gives a probability-like output.
      - Hidden Layers: Occasionally used in hidden layers, but less common now due to 
        issues like vanishing gradients.
- Pros: 
      - Smooth gradient.
      - Outputs can be interpreted as probabilities.
- Cons: 
      - Vanishing gradient problem, especially for deep networks.
      - Outputs not zero-centered, which can slow down convergence.

Softmax Activation Function
- Range: Outputs a probability distribution over multiple classes, where the sum 
  of the probabilities is 1.
- Use Cases:
      - Multiclass Classification: Used in the output layer for multiclass 
        lassification problems.
      - Output Layers: Effective in the final layer where a probabilistic 
        interpretation is needed across multiple categories.
- Pros: 
      - Provides a clear probabilistic interpretation.
      - Suitable for mutually exclusive class outputs.
- Cons: 
      - Computationally more expensive than simpler functions.
      - Can be less interpretable in the presence of non-mutually exclusive classes.

Other Activation Functions
- ReLU (Rectified Linear Unit):
- Range: Outputs values from 0 to infinity.
- Use Cases: 
      - Very popular for hidden layers in deep learning models due to its simplicity 
        and effectiveness.
- Pros:
      - Reduces the likelihood of the vanishing gradient problem.
      - Computationally efficient.
- Cons: 
      - Can cause dead neurons (outputting zero for all inputs).

- Tanh (Hyperbolic Tangent):
- Range: Outputs values between -1 and 1.
- Use Cases: 
      - Sometimes used in hidden layers where zero-centered outputs are desired.
- Pros: 
      - Zero-centered, which can help with convergence.
      - Strong gradients for inputs in the range [-1, 1].
- Cons: 
      - Still susceptible to the vanishing gradient problem.

- Leaky ReLU:
- Range: Outputs values from -infinity to infinity.
- Use Cases: 
      - Similar to ReLU but addresses the dead neuron problem by allowing a small, 
        non-zero gradient when the unit is not active.
- Pros: 
      - Helps mitigate the dead neuron issue.
- Cons: 
      - Requires tuning of the negative slope parameter.

Choosing the Right Activation Function
1. Task Nature:
- For binary classification: Sigmoid.
- For multiclass classification: Softmax.
- For hidden layers: ReLU or Leaky ReLU are generally good starting points.

2. Network Depth:
- For deep networks, prefer activation functions that mitigate vanishing gradients 
  like ReLU and its variants.

3. Output Interpretation:
- For probabilistic outputs: Sigmoid or Softmax.

4. Experimentation:
- Sometimes the best choice can be task-specific, requiring empirical testing to 
  see what works best for your specific dataset and model architecture.

By considering these factors, you can make an informed decision on which activation 
function to use in your neural network models.

Regularization in deep learning is a technique used to prevent overfitting, 
which occurs when a model learns the training data too well and performs poorly 
on new, unseen data. Regularization helps the model generalize better by adding 
a penalty to the loss function, discouraging the model from becoming too complex.

Here are a few common regularization techniques:

1. L1 and L2 Regularization* These add a penalty term to the loss function 
based on the size of the weights.
- L1 regularization (Lasso) adds the absolute value of the weights to the 
loss function.
- L2 regularization (Ridge) adds the squared value of the weights to the 
loss function.

2. Dropout Randomly sets a fraction of the input units to 0 at each update 
during training, which helps prevent the model from becoming too reliant on 
specific neurons.

3. Early Stopping: Monitors the model's performance on a validation set and s
tops training when performance no longer improves, preventing the model from 
overfitting the training data.

4. Data Augmentation: Increases the diversity of the training data by applying 
random transformations like rotation, scaling, and flipping.


Example of L2 Regularization in Keras

from keras.models import Sequential
from keras.layers import Dense
from keras.regularizers import l2
from keras.datasets import mnist
from keras.utils import to_categorical

# Load and preprocess the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28)).astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28)).astype('float32') / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# Define the model
model = Sequential([
Dense(512, activation='relu', input_shape=(28 * 28,), kernel_regularizer=l2(0.001)),
Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])

# Train the model
history = model.fit(train_images, train_labels, 
epochs=10, 
batch_size=128, 
validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Test accuracy:', test_acc)


In this example:
- We use the MNIST dataset, a standard dataset for digit recognition.
- The model has one hidden layer with 512 units and L2 regularization applied 
to its weights with a regularization factor of 0.001.
- The `kernel_regularizer=l2(0.001)` part adds the L2 penalty to the loss function, 
helping to prevent overfitting.
- The model is trained for 10 epochs with a batch size of 128 and 20% of the 
training data is used for validation (`validation_split=0.2`).

Regularization helps the model to not only fit the training data but also 
generalize well to new, unseen data.

In machine learning, parameters and hyperparameters are crucial concepts, but they 
refer to different aspects of the model.

Parameters
Parameters are internal variables of the model that are learned from the training 
data. They are the values that the learning algorithm adjusts during training to 
minimize the loss function. For example, in a neural network, the weights and biases of the neurons are parameters. Parameters are updated through optimization techniques like gradient descent.

Example:
- In a linear regression model [latex] y = wx + b [/latex], [latex] w [/latex] (weight) and [latex] b [/latex] (bias) are parameters.
- In a neural network, the weights and biases of the layers are parameters.

Hyperparameters
Hyperparameters are external to the model and set before the learning process begins. 
They control the training process and the model architecture. Hyperparameters are 
not learned from the training data but are often set through experimentation and 
tuning to find the best performance.

Example:
- Learning rate: Determines the step size during gradient descent.
- Number of epochs: The number of times the learning algorithm will work through 
the entire training dataset.
- Batch size: The number of training examples utilized in one iteration.
- Number of layers and units in each layer in a neural network.
- Regularization parameters like [latex] \lambda\ [/latex] in L2 regularization.

Key Differences
- Learning Process: Parameters are learned from the data, while hyperparameters 
are set before training.
- Role: Parameters define the model's final configuration, whereas hyperparameters 
guide the learning process and model structure.
- Adjustment: Parameters are adjusted by the training algorithm, while hyperparameters 
are typically adjusted through techniques like grid search or random search.

Example in Context of Neural Networks

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras.datasets import mnist
from keras.utils import to_categorical

# Load and preprocess the MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28)).astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28)).astype('float32') / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

# Define the model
model = Sequential([
Dense(512, activation='relu', input_shape=(28 * 28,)),
Dense(10, activation='softmax')
])

# Define hyperparameters
learning_rate = 0.001
batch_size = 128
epochs = 10

# Compile the model with the Adam optimizer (setting the learning rate)
model.compile(optimizer=Adam(learning_rate=learning_rate),
loss='categorical_crossentropy',
metrics=['accuracy'])

# Train the model
history = model.fit(train_images, train_labels, 
epochs=epochs, 
batch_size=batch_size, 
validation_split=0.2)

# Evaluate the model
test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Test accuracy:', test_acc)



In this example:
- Parameters: The weights and biases of the layers in the model, which are learned 
during the training process.
- Hyperparameters: 
- Learning rate (`learning_rate = 0.001`)
- Number of epochs (`epochs = 10`)
- Batch size (`batch_size = 128`)

Hyperparameters are set before training starts, and they influence how the parameters 
are adjusted during the training process.

As of now, both PyTorch and TensorFlow are highly popular and widely used frameworks 
for deep learning, each with its own strengths and community support. However, some 
trends can be observed:

1. PyTorch: - Adoption in Research: PyTorch is particularly favored in the research 
community due to its dynamic computation graph, which makes it easier to debug and 
develop new models. 
- Ease of Use: It is often considered more Pythonic and intuitive, 
which makes it easier for beginners and researchers to work with. 
- Growing Industry Adoption: While it started with a strong presence in academia, 
PyTorch is increasingly being adopted in industry settings as well. 

2. TensorFlow: 
- Industry Standard: TensorFlow has been widely adopted in the industry, 
especially for production and deployment. It provides robust tools for scalability 
and deployment, including TensorFlow Serving and TensorFlow Lite for mobile and 
embedded devices. 
- Comprehensive Ecosystem: TensorFlow has a comprehensive ecosystem that includes
 TensorFlow Extended (TFX) for end-to-end machine learning pipelines, TensorFlow.js 
for running models in the browser, and TensorFlow Hub for sharing pre-trained models. 
- TensorFlow 2.0: The release of TensorFlow 2.0 made the framework more user-friendly 
by integrating Keras as its high-level API, which has narrowed the gap in ease of use 
compared to PyTorch. 

Overall, the choice between PyTorch and TensorFlow often comes down to specific use 
cases and personal or team preferences. Researchers may prefer PyTorch for its 
flexibility, while companies looking to deploy models at scale might lean towards 
TensorFlow for its production-ready capabilities. Both frameworks continue to 
evolve rapidly, with active development and new features being added regularly.

deep learning math

Sure, let’s break down the key mathematical concepts in deep learning with equations and explanations:

1. Linear Algebra:
– Scalars, Vectors, Matrices, and Tensors:
– Scalars: Single numerical value (e.g., [latex]a[/latex]).
– Vectors: Ordered array of scalars (e.g., [latex]\mathbf{v} = [v_1,v_2,..v_n][/latex]).
– Matrices: 2D array of scalars (e.g., [latex]A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}[/latex]).
– Tensors: Generalization of matrices to higher dimensions.
– Matrix Multiplication:
– [latex]C = AB[/latex] where [latex]C_{ij} = \sum_k A_{ik}B_{kj}[/latex].
– Transpose of a Matrix:
– [latex]A^T[/latex] where [latex]A^T_{ij} = A_{ji}[/latex].
– Dot Product:
– [latex]c = \mathbf{v}_1 \cdot \mathbf{v}_2 = \sum_{i} v_{1i}v_{2i}[/latex].
– Hadamard Product:
– [latex]C = A \odot B[/latex] where [latex]C_{ij} = A_{ij} \times B_{ij}[/latex].

2. Calculus:
– Derivatives:
– The derivative of a function [latex]f(x)[/latex] with respect to [latex]x[/latex], denoted as [latex]f'(x)[/latex], represents the rate of change of [latex]f[/latex] at point [latex]x[/latex].
– Chain Rule:
– If [latex]f(x) = g(h(x))[/latex], then [latex]f'(x) = g'(h(x)) \cdot h'(x)[/latex].
– Gradient Descent:
– Update rule: [latex]x_{t+1} = x_t – \alpha \nabla f(x_t)[/latex], where [latex]f(x_t)[/latex] is the objective function, [latex]\nabla f(x_t)[/latex] is its gradient, and [latex]\alpha[/latex] is the learning rate.
– Partial Derivatives and Gradients:
– If [latex] f(x_1,x_2,x_3,..x_n) [/latex], the gradient [latex]\nabla f = \left[\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2},\frac{\partial f}{\partial x_n} \right][/latex].

3. Probability and Statistics:
– Probability Distributions:
– Gaussian (Normal), Bernoulli, etc.
– Expectation and Variance:
– Expectation: [latex]E[X] = \sum_x x P(X=x)[/latex].
– Variance: [latex]Var(X) = E[(X – \mu)^2][/latex].
– Maximum Likelihood Estimation (MLE):
– Estimating parameters that maximize the likelihood of observing the data.
– Bayesian Inference and Bayes’ Theorem:
– [latex]P(A|B) = \frac{P(B|A)P(A)}{P(B)}[/latex].

4. Optimization:
– Gradient Descent:
– Update rule: [latex]x_{t+1} = x_t – \alpha \nabla f(x_t)[/latex].
– Stochastic Gradient Descent (SGD):
– Mini-batch update: [latex]x_{t+1} = x_t – \alpha \nabla f(x_t; \mathcal{D}_t)[/latex], where [latex]\mathcal{D}_t[/latex] is a random mini-batch.
– Adam, RMSProp, etc.:
– Advanced optimization algorithms with adaptive learning rates.

5. Neural Networks:
– Activation Functions:
– Sigmoid: [latex]f(x) = \frac{1}{1 + e^{-x}}[/latex].
– ReLU: [latex]f(x) = \max(0, x)[/latex].
– Feedforward Propagation:
– [latex]z = Wx + b[/latex], [latex]a = \text{activation}(z)[/latex].
– Backpropagation:
– Compute gradients of the loss with respect to weights using the chain rule.
– CNNs, RNNs, LSTMs, GRUs:
– Architectures for handling specific types of data and learning tasks.

6. Loss Functions:
– Mean Squared Error (MSE):
– [latex]L(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2[/latex].
– Cross-Entropy Loss:
– [latex]L(y, \hat{y}) = -\sum_{i=1}^{n} y_i \log(\hat{y}_i)[/latex].

7. Regularization:
– L1 and L2 Regularization:
– Add penalty terms to the loss function: [latex]L_{\text{regularized}} = L_{\text{original}} + \lambda \| \theta \|_p[/latex].
– Dropout:
– Randomly drop units during training to prevent overfitting.
– Batch Normalization:
– Normalize inputs of each layer to speed up training and reduce overfitting.

Understanding these equations and concepts provides a solid foundation for diving deeper into the mathematics behind deep learning.

tensorflow & keras

tensor

In TensorFlow, a tensor is a multi-dimensional array used to represent data. Tensors are the primary data structure in TensorFlow, and they are used for both input data and the parameters of machine learning models. They can have various dimensions, known as ranks, such as:

Rank 0 Tensor: A scalar (e.g., 5).
Rank 1 Tensor: A vector (e.g., [1, 2, 3]).
Rank 2 Tensor: A matrix (e.g., [[1, 2], [3, 4]]).
Higher Rank Tensors: Arrays with three or more dimensions.

In TensorFlow, tensors have several key attributes that define their properties and how they can be used in computations. Here are the key attributes of a tensor:

1. Rank
The rank of a tensor refers to the number of dimensions it has. For example:
– A scalar has a rank of 0.
– A vector has a rank of 1.
– A matrix has a rank of 2.
– Higher-dimensional arrays have ranks 3 and above.

2. Shape
The shape of a tensor is a tuple that describes the size of each dimension. For example:
– A scalar has an empty shape `()`.
– A vector with 5 elements has a shape `(5,)`.
– A matrix with 3 rows and 4 columns has a shape `(3, 4)`.
– A 3-dimensional tensor with dimensions 2, 3, and 4 has a shape `(2, 3, 4)`.

3. Data Type (dtype)
The data type of a tensor specifies the type of values it holds, such as:
– `tf.float32`: 32-bit floating point.
– `tf.int32`: 32-bit integer.
– `tf.bool`: Boolean values.

4. Device
The device attribute specifies the hardware device where the tensor is stored and on which computations are performed, such as:
– CPU: `/device:CPU:0`
– GPU: `/device:GPU:0`

Example in TensorFlow -I ran this in deepnote.com

import tensorflow as tf
tensor = tf.constant([[1.0, 2.0], [3.0, 4.0]], dtype=tf.float32)

print(tensor)
print("\nRank:", tf.rank(tensor).numpy())
print("Shape:", tensor.shape)
print("Data Type:", tensor.dtype)
print("Device:", tensor.device)

Output:

tf.Tensor( [[1. 2.] [3. 4.]], shape=(2, 2), dtype=float32) 
Rank: 2 
Shape: (2, 2) 
Data Type: <dtype: 'float32'>
Device: /job:localhost/replica:0/task:0/device:CPU:0

import tensorflow as tf

# Creating Tensors
tensor_a = tf.constant([[1, 2, 3], [4, 5, 6]])
tensor_b = tf.constant([[7, 8, 9], [10, 11, 12]])

# Basic Tensor Operations
# Addition
tensor_add = tf.add(tensor_a, tensor_b)
print("Addition:\n", tensor_add.numpy())

# Subtraction
tensor_sub = tf.subtract(tensor_a, tensor_b)
print("Subtraction:\n", tensor_sub.numpy())

# Element-wise Multiplication
tensor_mul = tf.multiply(tensor_a, tensor_b)
print("Element-wise Multiplication:\n", tensor_mul.numpy())

# Matrix Multiplication
tensor_matmul = tf.matmul(tensor_a, tensor_b, transpose_b=True)
print("Matrix Multiplication:\n", tensor_matmul.numpy())

# Division
tensor_div = tf.divide(tensor_a, tensor_b)
print("Division:\n", tensor_div.numpy())

# Creating Tensors with different data types
tensor_float = tf.constant([[1.1, 2.2], [3.3, 4.4]], dtype=tf.float32)
tensor_int = tf.constant([[1, 2], [3, 4]], dtype=tf.int32)

# Reshaping Tensors
tensor_reshaped = tf.reshape(tensor_a, [3, 2])
print("Reshaped Tensor:\n", tensor_reshaped.numpy())

# Transposing Tensors
tensor_transposed = tf.transpose(tensor_a)
print("Transposed Tensor:\n", tensor_transposed.numpy())

# Reducing dimensions
tensor_sum = tf.reduce_sum(tensor_a)
print("Sum of all elements:\n", tensor_sum.numpy())

tensor_max = tf.reduce_max(tensor_a)
print("Maximum element:\n", tensor_max.numpy())

# Broadcasting
tensor_c = tf.constant([1, 2, 3])
tensor_broadcasted_add = tf.add(tensor_a, tensor_c)
print("Broadcasted Addition:\n", tensor_broadcasted_add.numpy())

# Applying functions element-wise
tensor_squared = tf.square(tensor_a)
print("Element-wise Squaring:\n", tensor_squared.numpy())

# Creating a Tensor from NumPy array
import numpy as np
np_array = np.array([[1, 2, 3], [4, 5, 6]])
tensor_from_np = tf.convert_to_tensor(np_array, dtype=tf.int32)
print("Tensor from NumPy array:\n", tensor_from_np.numpy())

output:
Addition:
[[ 8 10 12]
[14 16 18]]
Subtraction:
[[-6 -6 -6]
[-6 -6 -6]]
Element-wise Multiplication:
[[ 7 16 27]
[40 55 72]]
Matrix Multiplication:
[[ 50 68]
[122 167]]
Division:
[[0.14285714 0.25 0.33333333]
[0.4 0.45454545 0.5 ]]
Reshaped Tensor:
[[1 2]
[3 4]
[5 6]]
Transposed Tensor:
[[1 4]
[2 5]
[3 6]]
Sum of all elements:
21
Maximum element:
6
Broadcasted Addition:
[[2 4 6]
[5 7 9]]
Element-wise Squaring:
[[ 1 4 9]
[16 25 36]]
Tensor from NumPy array:
[[1 2 3]
[4 5 6]]

string = tf.Variable("this is a string", tf.string) 
number = tf.Variable(324, tf.int16)
floating = tf.Variable(3.567, tf.float64)

from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import LSTM 
from tensorflow.keras.layers import Dense 
from tensorflow.keras.layers import Flatten 
from tensorflow.keras.layers import TimeDistributed 
from tensorflow.keras.layers import Conv1D 
from tensorflow.keras.layers import MaxPooling1D

keras examples

From tensorflow web site

https://www.tensorflow.org/tutorials

# The example from the famous book by Chollet.

from tensorflow import keras 
from tensorflow.keras import layers
model = keras.Sequential([
layers.Dense(512, activation="relu"),       # fully connected layer
layers.Dense(10, activation="softmax")      # 10-way softmax classification layer, returning an array of 10 probability scores
])

model.compile(optimizer="rmsprop",          # the model will update itself based on the training data it sees
loss="sparse_categorical_crossentropy",     # How the model will be able to measure its performance on the training data 
metrics=["accuracy"])                       # the fraction of the images that were correctly classified

# scaling - the model expects and scaling it so that all values are in the [0, 1] interval.
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype("float32") / 255 
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255

model.fit(train_images, train_labels, epochs=5, batch_size=128)

>>> test_digits = test_images[0:10]
>>> predictions = model.predict(test_digits)
>>> predictions[0]
array([1.0726176e-10, 1.6918376e-10, 6.1314843e-08, 8.4106023e-06,
2.9967067e-11, 3.0331331e-09, 8.3651971e-14, 9.9999106e-01,
2.6657624e-08, 3.8127661e-07], dtype=float32)

# This first test digit has the highest probability score (0.99999106, almost 1) at index 7, so according to our model, it must be a 7:
>>> predictions[0].argmax() 
7 
>>> predictions[0][7] 
0.99999106
>>> test_labels[0] 
7 

>>> test_loss, test_acc = model.evaluate(test_images, test_labels)
>>> print(f"test_acc: {test_acc}")
test_acc: 0.9785

Source: Chollet 2022, Deep Learning with Python, Second Edition, CH 2

# A simple example of Keras by Chat GPT

Importing modules:
- from keras.models import Sequential: Imports the Sequential model API from Keras, which is useful for creating a linear stack of layers (simple models).
- from keras.layers import Dense: Imports the Dense layer, which is a fully connected neural network layer.
Model initialization:
- model = Sequential(): Creates an instance of a Sequential model. This will be the container into which we will add our layers.
Adding layers:
- model.add(Dense(12, activation='relu', input_shape=(n_features,))): Adds a fully connected layer (Dense) with 12 neurons. The activation='relu' argument specifies the Rectified Linear Unit activation function. The input_shape should match the number of features in your dataset (excluding the target variable).
Adding a hidden layer:
- model.add(Dense(8, activation='relu')): Adds another dense layer with 8 neurons, also with ReLU activation. This is the hidden layer.
Adding the output layer:
- model.add(Dense(1, activation='sigmoid')): Adds the output layer with a single neuron because it’s a binary classification. activation='sigmoid' is used because it outputs a probability between 0 and 1, which is ideal for binary classification.
Compiling the model:
- model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']): Compiles the model for training. loss='binary_crossentropy' is the loss function commonly used for binary classification. optimizer='adam' is an efficient gradient descent algorithm, and we choose to monitor accuracy during training.
Fitting the model:
- model.fit(X_train, y_train, epochs=50, batch_size=1, verbose=1): Fits the model on the training data. epochs=50 means the entire dataset is passed forward and backward through the neural network 50 times. batch_size=1 indicates we are using stochastic gradient descent (one sample per gradient update). verbose=1 shows the training progress.
Evaluating the model:
- model.evaluate(X_test, y_test, verbose=0): Evaluates the model on the testing set quietly (verbose=0). Returns loss and accuracy.
Print accuracy:
- print('Accuracy: %.2f' % accuracy): Prints the accuracy of the model after evaluation.

# Another example

This code is very basic and meant for educational purposes. In a real-world scenario, you’d need to preprocess your data, possibly scale it, and handle other aspects like model validation and hyperparameter tuning.

Import Libraries:
- Import necessary modules for data manipulation, model creation, and evaluation.
Load Dataset:
- The Iris dataset is loaded from scikit-learn. It contains 4 features and 3 classes of iris species.
Preprocess Data:
- StandardScaler: Scales the features to have zero mean and unit variance, which helps in faster convergence of neural networks.
- OneHotEncoder: Since this is a multi-class classification problem (3 classes), the target variable y is one-hot encoded. This converts it into a format suitable for softmax classification.
- train_test_split: Splits the dataset into training and testing sets with 70% training and 30% testing.
Build Neural Network Model:
- A Sequential model is created, followed by adding two hidden layers with 10 neurons each and ReLU activation. The output layer has 3 neurons (one for each class) with softmax activation, which is used for multi-class classification.
Compile the Model:
- The model is compiled with the Adam optimizer, categorical crossentropy loss function (suitable for multi-class classification problems), and accuracy as a metric to track during training.
Train the Model:
- The model is trained for 100 epochs with a batch size of 5. This means the model will see the training data 100 times and update the weights in batches of 5 samples each.
Evaluate the Model:
- Finally, the model’s performance is evaluated on the test set. Loss and accuracy are printed out. Lower loss and higher accuracy indicate better model performance.

When you run this code, it will display the accuracy and loss after training and testing, giving you insights into how well the model has learned to classify the iris species. You can adjust the number of epochs, batch size, and model architecture to see how these changes affect model performance.

### Example 
#feature normalization (or scaling)
normalized_feature = keras.utils.normalize(X.values)

# Import train_test_split function from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Split up the data into a training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

# Build the Network
from tensorflow import keras
from keras.models import Sequential
#from tensorflow.keras.models import Sequential
from keras.layers import Dense

## Build Model (Building a three layer network - with one hidden layer)
model = Sequential()
model.add(Dense(4, input_dim=4, activation ='relu'))  
# You don't have to specify input size. Just define the hidden layers
model.add(Dense(3, activation='relu'))
model.add(Dense(1))

# Compile Model
model.compile(optimizer='adam', loss='mse', metrics=['mse'])

#  Fit the Model
history = model.fit(X_train, y_train, validation_data = (X_test, y_test),
                    epochs = 32)

#inspect the model
model.summary()

model.evaluate(X_test, y_test)[1]

# predict SALES using the test data
test_predictions = model.predict(X_test).flatten()

### Example 

#Create a scaler model that is fit on the input data.
scaler = StandardScaler().fit(X_data)

#Scale the numeric feature variables
X_data = scaler.transform(X_data)

#Convert target variable as a one-hot-encoding array
Y_data = tf.keras.utils.to_categorical(Y_data,3)

#Split training and test data
X_train,X_test,Y_train,Y_test = train_test_split( X_data, Y_data, test_size=0.10)
from tensorflow import keras

#Number of classes in the target variable
NB_CLASSES=3

#Create a sequencial model in Keras
model = tf.keras.models.Sequential()

#Add the first hidden layer
model.add(keras.layers.Dense(128,                   #Number of nodes
                             input_shape=(4,),      #Number of input variables
                             name='Hidden-Layer-1', #Logical name
                             activation='relu'))    #activation function

#Add a second hidden layer
model.add(keras.layers.Dense(128,
                             name='Hidden-Layer-2',
                             activation='relu'))

#Add an output layer with softmax activation
model.add(keras.layers.Dense(NB_CLASSES,
                             name='Output-Layer',
                             activation='softmax'))

#Compile the model with loss & metrics
model.compile(loss='categorical_crossentropy', metrics=['accuracy'])

#Print the model meta-data
model.summary()

#Make it verbose so we can see the progress
VERBOSE=1

#Setup Hyper Parameters for training

#Set Batch size
BATCH_SIZE=16
#Set number of epochs
EPOCHS=10
#Set validation split. 20% of the training data will be used for validation
#after each epoch
VALIDATION_SPLIT=0.2

print("\nTraining Progress:\n------------------------------------")

#Fit the model. This will perform the entire training cycle, including
#forward propagation, loss computation, backward propagation and gradient descent.
#Execute for the specified batch sizes and epoch
#Perform validation after each epoch
history=model.fit(X_train,
Y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS,
verbose=VERBOSE,
validation_split=VALIDATION_SPLIT)

print("\nAccuracy during Training :\n------------------------------------")
import matplotlib.pyplot as plt

#Plot accuracy of the model after each epoch.
pd.DataFrame(history.history)["accuracy"].plot(figsize=(8, 5))
plt.title("Accuracy improvements with Epoch")
plt.show()

#Evaluate the model against the test dataset and print results
print("\nEvaluation against Test Dataset :\n------------------------------------")
model.evaluate(X_test,Y_test)

### Example   https://elitedatascience.com/keras-tutorial-deep-learning-in-python
# 3. Import libraries and modules
import numpy as np
np.random.seed(123)  # for reproducibility

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils
from keras.datasets import mnist

# 4. Load pre-shuffled MNIST data into train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# 5. Preprocess input data
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

# 6. Preprocess class labels
Y_train = np_utils.to_categorical(y_train, 10)
Y_test = np_utils.to_categorical(y_test, 10)

# 7. Define model architecture
model = Sequential()
model.add(Convolution2D(32, (3,3), activation='relu', input_shape=(28,28,1)))
model.add(Convolution2D(32, (3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

# 8. Compile model
model.compile(loss='categorical_crossentropy',

              optimizer='adam',

              metrics=['accuracy'])

# 9. Fit model on training data
model.fit(X_train, Y_train,

          batch_size=32, epochs=10, verbose=1)

# 10. Evaluate model on test data
score = model.evaluate(X_test, Y_test, verbose=0)

batches

In deep learning, data batching is a technique used to improve the efficiency and performance of training models. Instead of processing the entire dataset at once, the dataset is divided into smaller chunks called batches. Each batch is processed independently through the neural network during training. This approach has several benefits, including reducing memory usage and enabling faster and more stable convergence of the model.

Concept of Data Batches

1. Batch Size:
– The batch size is the number of samples processed before the model’s internal parameters (weights and biases) are updated.
– Common batch sizes are powers of 2, such as 32, 64, 128, etc.

2. Epoch:
– An epoch is one complete pass through the entire training dataset.
– If you have a dataset with 1,000 samples and a batch size of 100, it will take 10 batches to complete one epoch.

3. Iterations:
– An iteration refers to one update of the model’s parameters. It corresponds to processing one batch of data.
– If you have a dataset with 1,000 samples, a batch size of 100, and you train for 10 epochs, you will have 100 iterations (10 epochs * 10 batches per epoch).

Benefits of Using Data Batches

1. Memory Efficiency:
– Processing the entire dataset at once (batch size = dataset size) might not fit into memory, especially with large datasets. Batching allows training on smaller chunks that fit in memory.

2. Computational Efficiency:
– Modern hardware, such as GPUs, are optimized for batch processing. Using batches can lead to more efficient use of computational resources.

3. Stable Training:
– Batching helps to smooth out the gradient updates, which can lead to more stable training and better convergence.

Example of Data Batching in Practice

Let’s consider a simple example using TensorFlow and the MNIST dataset, which consists of 60,000 training images of handwritten digits.

import tensorflow as tf

# Load the MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Normalize the data
x_train, x_test = x_train / 255.0, x_test / 255.0

# Create a simple model
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])

# Compile the model
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])

# Train the model with a batch size of 32
model.fit(x_train, y_train, epochs=5, batch_size=32)

# Evaluate the model
model.evaluate(x_test, y_test, verbose=2)

Explanation

1. Loading and Normalizing Data:
– The MNIST dataset is loaded and normalized to have values between 0 and 1.

2. Model Creation:
– A simple neural network model is created with one hidden layer and an output layer.

3. Compiling the Model:
– The model is compiled with an optimizer, loss function, and metrics.

4. Training with Batches:
– The model is trained using `model.fit()` with a specified batch size of 32.
– This means during each epoch, the model will process 32 samples at a time, update the parameters, and then proceed to the next batch until the entire dataset is covered.

5. Evaluation:
– After training, the model is evaluated on the test set.

Using batches allows for more efficient training and can lead to better performance and faster convergence compared to processing the entire dataset at once.

broadcasting

Broadcasting is a powerful mechanism that allows numpy and TensorFlow to perform arithmetic operations on arrays (tensors) of different shapes. It does this by virtually expanding the smaller array along the mismatched dimensions so that they have compatible shapes. Broadcasting makes many arithmetic operations easier to write and understand.

Broadcasting Rules

1. Dimensions Compatibility: Two dimensions are compatible when:
– They are equal, or
– One of them is 1.

2. Align from the Right: When comparing two arrays, start with the trailing (right-most) dimensions and work your way left. If the dimensions are not compatible, broadcasting cannot be performed.

How Broadcasting Works

When performing operations on two tensors, broadcasting happens as follows:
– Expand dimensions: If one tensor has fewer dimensions than the other, leading dimensions (on the left) of size 1 are added to the smaller tensor.
– Stretch dimensions: Dimensions of size 1 in the smaller tensor are virtually expanded to match the size of the corresponding dimension in the larger tensor.

Example 1: Adding a Scalar to a Tensor

import tensorflow as tf

# Tensor
tensor_a = tf.constant([[1, 2, 3], [4, 5, 6]], dtype=tf.float32)
# Scalar
scalar_b = tf.constant(2, dtype=tf.float32)

# Broadcasting addition
result = tensor_a + scalar_b
print("Broadcasting Addition:\n", result.numpy())

Output:

Broadcasting Addition:
[[3. 4. 5.]
[6. 7. 8.]]

Explanation: The scalar `2` is broadcasted to match the shape of `tensor_a`, effectively creating an array `[[2, 2, 2], [2, 2, 2]]` before performing element-wise addition.

Example 2: Adding a Vector to a Matrix

# Matrix
tensor_a = tf.constant([[1, 2, 3], [4, 5, 6]], dtype=tf.float32)
# Vector
vector_b = tf.constant([1, 2, 3], dtype=tf.float32)

# Broadcasting addition
result = tensor_a + vector_b
print("Broadcasting Addition:\n", result.numpy())


Output:

Broadcasting Addition:
[[2. 4. 6.]
[5. 7. 9.]]

Explanation: The vector `[1, 2, 3]` is broadcasted to match the shape of `tensor_a`, effectively creating an array `[[1, 2, 3], [1, 2, 3]]` before performing element-wise addition.

Example 3: Matrix and Higher-Dimensional Tensor

# 2D Tensor (Matrix)
tensor_a = tf.constant([[1, 2], [3, 4], [5, 6]], dtype=tf.float32)
# 1D Tensor (Vector)
tensor_b = tf.constant([1, 2], dtype=tf.float32)

# Broadcasting addition
result = tensor_a + tensor_b
print("Broadcasting Addition:\n", result.numpy())
```

Output:

Broadcasting Addition:
[[2. 4.]
[4. 6.]
[6. 8.]]

Explanation: The vector [1, 2] is broadcasted across the rows of the matrix tensor_a.

Summary

Broadcasting simplifies arithmetic operations on tensors of different shapes by automatically expanding smaller tensors along the mismatched dimensions. The main rules are that dimensions must either match or be of size 1, and TensorFlow (or numpy) will handle the rest. This allows for more concise and readable code when performing element-wise operations on tensors.

gradient-based optimization

Gradient-based optimization is a cornerstone technique in machine learning, particularly in training neural networks. It involves using the gradient of a loss function with respect to the model’s parameters to minimize the loss and improve the model’s performance. The most common gradient-based optimization method is gradient descent.

Key Concepts

1. Gradient: The gradient is a vector of partial derivatives that indicates the direction and rate of the fastest increase of a function. For a loss function $ L(\theta) $ where $ \theta $ represents the model parameters, the gradient $ \nabla L(\theta) $ points in the direction of the steepest ascent. To minimize the loss, we move in the opposite direction of the gradient.

2. Loss Function: This is a measure of how well the model’s predictions match the actual data. Common loss functions include mean squared error for regression tasks and cross-entropy for classification tasks.

3. Learning Rate: A hyperparameter that controls the size of the steps taken to reach a minimum. A learning rate that is too high may overshoot the minimum, while a learning rate that is too low may take too long to converge.

Gradient Descent Variants

1. Batch Gradient Descent: Computes the gradient of the loss function with respect to the entire dataset. While this approach is accurate, it can be very slow and computationally expensive for large datasets.

2. Stochastic Gradient Descent (SGD): Computes the gradient of the loss function with respect to a single training example. This approach is faster and can handle large datasets, but the updates can be noisy and cause the loss to fluctuate.

3. Mini-Batch Gradient Descent: Computes the gradient of the loss function with respect to a small batch of training examples. This approach balances the trade-offs between batch and stochastic gradient descent, offering faster convergence and more stable updates.

Optimization Algorithms

1. Standard Gradient Descent:

[latex]
\theta := \theta – \eta \nabla L(\theta)
[/latex]

where [latex] \eta [/latex] is the learning rate.

2. Momentum:

[latex]
v := \beta v + \eta \nabla L(\theta)
[/latex]

[latex]
\theta := \theta – v
[/latex]

Momentum helps accelerate gradient descent by considering the previous gradients to smooth out the updates.

3. RMSprop:

[latex]
E[g^2]_t := \gamma E[g^2]_{t-1} + (1 – \gamma)g_t^2
[/latex]

[latex]
\theta := \theta – \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t
[/latex]

RMSprop adjusts the learning rate for each parameter, scaling it by the inverse square root of the running average of recent gradients.

4. Adam (Adaptive Moment Estimation):

[latex]
m_t := \beta_1 m_{t-1} + (1 – \beta_1) g_t
[/latex]

[latex]
v_t := \beta_2 v_{t-1} + (1 – \beta_2) g_t^2
[/latex]

[latex]
\hat{m}_t := \frac{m_t}{1 – \beta_1^t}
[/latex]

[latex]
\hat{v}_t := \frac{v_t}{1 – \beta_2^t}
[/latex]

[latex]
\theta := \theta – \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
[/latex]

Adam combines the ideas of momentum and RMSprop, maintaining an exponentially decaying average of past gradients and squared gradients.

Example in TensorFlow

Here is an example of using gradient descent with TensorFlow to optimize a simple linear regression model.

import tensorflow as tf
import numpy as np

# Generate some synthetic data
np.random.seed(0)
X = np.random.rand(100, 1)
y = 2 * X + 1 + 0.1 * np.random.randn(100, 1)

# Define the model
class LinearModel(tf.Module):
def __init__(self):
self.W = tf.Variable(np.random.randn(), dtype=tf.float32)
self.b = tf.Variable(np.random.randn(), dtype=tf.float32)

def __call__(self, x):
return self.W * x + self.b

# Define the loss function
def loss_fn(model, x, y):
y_pred = model(x)
return tf.reduce_mean(tf.square(y - y_pred))

# Training function
def train(model, x, y, learning_rate):
with tf.GradientTape() as tape:
loss = loss_fn(model, x, y)
gradients = tape.gradient(loss, [model.W, model.b])
model.W.assign_sub(learning_rate * gradients[0])
model.b.assign_sub(learning_rate * gradients[1])

# Initialize the model
model = LinearModel()

# Training loop
learning_rate = 0.1
epochs = 100

for epoch in range(epochs):
train(model, X, y, learning_rate)
if epoch % 10 == 0:
current_loss = loss_fn(model, X, y)
print(f"Epoch {epoch}: Loss: {current_loss.numpy()}")

print(f"Trained Weights: W = {model.W.numpy()}, b = {model.b.numpy()}")

Explanation

1. Synthetic Data: Generating some synthetic data for a simple linear regression problem.

2. Model Definition: A simple linear model with one weight and one bias.

3. Loss Function: Mean squared error between the predicted and actual values.

4. Training Function: Uses TensorFlow’s `GradientTape` to compute the gradients and update the model parameters using gradient descent.

5. Training Loop: Iteratively updates the model parameters over a number of epochs.

Gradient-based optimization techniques like these are essential for training neural networks and other machine learning models, enabling them to learn from data and improve their performance over time.

activation function

ReLU: https://www.kaggle.com/code/dansbecker/rectified-linear-units-relu-in-deep-learning

overfitting

Overfitting is particularly likely to occur when your data is noisy, if it involves
uncertainty, or if it includes rare features. (Chollet, 2021)

Overfitting occurs when a statistical model or machine learning algorithm captures noise or random fluctuations in the training data rather than the underlying data distribution. This typically happens when the model is too complex, having too many parameters relative to the number of observations or features in the training data. As a result, the model performs very well on the training data but poorly on unseen or test data.

Here are some common causes of overfitting:

1. Excessive Model Complexity: Using models that are too complex for the amount of data available, such as deep neural networks with many layers for a relatively small dataset.

2. Insufficient Training Data: Not having enough training data can lead to the model capturing noise specific to the training set rather than generalizable patterns.

3. Noise in Data: If the training data contains a lot of noise or irrelevant information, the model might learn these noise patterns instead of the true underlying trends.

4. High Variance in the Model: Models with high variance are very flexible and can fit the training data very closely, which leads to overfitting.

Signs of Overfitting
– High accuracy on training data but significantly lower accuracy on validation/test data.
– The model captures noise and outliers in the training data.
– High variance in model performance when using different subsets of the training data.

Techniques to Prevent Overfitting
– Simplifying the Model: Reducing the complexity of the model by using fewer parameters or features.
– Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization add a penalty to the model complexity.
– Cross-Validation: Using k-fold cross-validation to ensure the model performs well on different subsets of the data.
– Pruning: In decision trees, pruning helps in removing parts of the tree that do not provide power to classify instances.
– Early Stopping: In iterative learning algorithms like gradient descent, training can be stopped early when performance on a validation set starts to degrade.
– Ensembling: Combining the predictions of multiple models (bagging, boosting, stacking) to reduce overfitting.

By applying these techniques, you can develop models that generalize better to new, unseen data.

Generalization in machine learning refers to the ability of a model to perform well on new, previously unseen data, drawn from the same distribution as the data used to train the model. It is a measure of how well the concepts learned by the model can be applied to real-world scenarios outside the training set.

Key Aspects of Generalization:

1. Training vs. Testing Performance:
– Training Performance: How well the model performs on the training data.
– Testing Performance: How well the model performs on a separate set of data not seen during training (testing/validation data). Good generalization is indicated by similar performance on both training and testing data.

2. Overfitting:
– Occurs when a model learns the training data too well, capturing noise and details that do not generalize to new data.
– An overfitted model has low training error but high testing error.

3. Underfitting:
– Occurs when a model is too simple to capture the underlying patterns in the training data.
– An underfitted model has high training error and, consequently, high testing error.

4. Bias-Variance Tradeoff:
– Bias: Error due to overly simplistic models that do not capture the complexity of the data.
– Variance: Error due to models that are too complex and sensitive to the noise in the training data.
– The goal is to find a balance between bias and variance to achieve good generalization.

5. Regularization:
– Techniques like L1 and L2 regularization, dropout, and early stopping are used to prevent overfitting and improve generalization.

6. Cross-Validation:
– A method to evaluate the generalization ability of a model by dividing the data into multiple subsets and training/testing the model on these subsets.

Techniques to Improve Generalization:

1. Data Augmentation:
– Increasing the diversity of training data by creating modified versions of existing data.

2. Ensemble Methods:
– Combining multiple models to reduce variance and improve robustness.

3. Hyperparameter Tuning:
– Optimizing the parameters that control the learning process to enhance performance.

4. Pruning:
– Reducing the complexity of models by removing parts that are not contributing significantly to predictions.

Generalization is crucial because it determines the practical usability of a machine learning model. A model that generalizes well is more likely to perform reliably when applied to new data in real-world situations.

training loop

Training Loop Steps

1. Initialization:
– The model’s weights (parameters) are initialized randomly.
– A dataset is prepared with input data (features) and corresponding correct output (labels).

2. Forward Pass:
– The input data is fed into the model.
– The model processes the input data through its layers and generates predictions.

3. Loss Calculation:
– The predictions are compared to the actual labels to calculate the loss.
– The loss is a measure of how far the model’s predictions are from the actual values.

4. Backward Pass (Backpropagation):
– The loss is propagated back through the network to calculate the gradients of the loss with respect to each weight.
– This involves using the chain rule to compute derivatives.

5. Weight Update (Optimization):
– The model’s weights are updated using an optimization algorithm (e.g., Stochastic Gradient Descent, Adam) to minimize the loss.
– This step uses the gradients calculated during the backward pass.

6. Iteration:
– Steps 2 to 5 are repeated for a fixed number of iterations (epochs) or until the model’s performance stops improving.

Example Training Loop in Python (using PyTorch)

import torch
import torch.nn as nn
import torch.optim as optim

# Dummy dataset
inputs = torch.randn(100, 10) # 100 samples, 10 features each
labels = torch.randn(100, 1) # 100 samples, 1 output each

# Simple model
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc = nn.Linear(10, 1)

def forward(self, x):
return self.fc(x)

model = SimpleModel()

# Loss function and optimizer
criterion = nn.MSELoss() # Mean Squared Error Loss
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
model.train() # Set model to training mode

# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)

# Backward pass and optimization
optimizer.zero_grad() # Clear the gradients
loss.backward() # Calculate the gradients
optimizer.step() # Update the weights

if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print("Training complete!")

Explanation of the Code

1. Dataset Preparation:
– `inputs` and `labels` are randomly generated tensors representing the features and labels.

2. Model Definition:
– `SimpleModel` is a simple neural network with one linear layer (`fc`).

3. Loss Function and Optimizer:
– `criterion` is the Mean Squared Error loss function.
– `optimizer` is Stochastic Gradient Descent (SGD) with a learning rate of 0.01.

4. Training Loop:
– The loop runs for `num_epochs` (100 iterations).
– In each iteration:
– The model performs a forward pass to make predictions (`outputs`).
– The loss is computed using the predictions and actual labels.
– The gradients are cleared using `optimizer.zero_grad()`.
– Backpropagation is performed using `loss.backward()` to compute gradients.
– The optimizer updates the model’s weights using `optimizer.step()`.
– Every 10 epochs, the current loss is printed.

This is a basic example to illustrate the core steps in a training loop. In real applications, the dataset would be split into training and validation sets, and more complex models and optimization techniques would be used.

chain Rule

The chain rule is a fundamental concept in calculus used extensively in deep learning for backpropagation. It helps in calculating the gradient of a loss function with respect to each weight in a neural network. Let’s break it down with an example.

The chain rule is used to compute the derivative of a composite function. If you have a function [latex] z = f(g(x)) [/latex], the chain rule states that the derivative of z with respect to x is:

[latex]
\frac{dz}{dx} = \frac{dz}{dg} \cdot \frac{dg}{dx}
[/latex]

In other words, you first compute the derivative of the outer function f with respect to the inner function g, and then multiply it by the derivative of the inner function g with respect to x.

Chain Rule in Deep Learning

In a neural network, you typically have multiple layers, and you need to compute how changes in the weights of each layer affect the final loss. Here’s a simple neural network with one hidden layer:

1. Input Layer: [latex]x[/latex]
2. Hidden Layer: [latex] h = f(W_1 \cdot x + b_1) [/latex]
3. Output Layer: [latex] y = g(W_2 \cdot h + b_2) [/latex]
4. Loss Function: [latex] L = \text{loss}(y, \text{true\_label}) [/latex]

To update the weights [latex]W_1[/latex]and [latex]W_2[/latex], you need to compute the gradient of the loss [latex] L [/latex] with respect to these weights.

Example with Detailed Steps

Let’s take a specific example with simple functions and numbers.

1. Forward Pass:
– Suppose [latex]x = 1[/latex]
– [latex]W_1 = 2[/latex], [latex]b_1 = 0 [/latex], so [latex]h = f(W_1 \cdot x + b_1) = f(2 \cdot 1 + 0) = f(2) [/latex]
– Assume [latex] f(z) = z^2 [/latex], so [latex] h = 2^2 = 4 [/latex]
– [latex] W_2 = 3 [/latex], [latex]b_2 = 0[/latex], so [latex] y = g(W_2 \cdot h + b_2) = g(3 \cdot 4 + 0) = g(12) [/latex]
– Assume [latex]g(z) = z[/latex], so y = 12
– Suppose the true label is [latex]10 [/latex], and the loss function is Mean Squared Error: [latex] L = (y – \text{true\_label})^2 = (12 – 10)^2 = 4 [/latex]

2. Backward Pass (Using Chain Rule):
– Compute the gradient of the loss with respect to [latex] y [/latex]:
[latex] \frac{dL}{dy} = 2 \cdot (y – \text{true\_label}) = 2 \cdot (12 – 10) = 4 [/latex]

– Compute the gradient of $ y $ with respect to $ W_2 $:
[latex]\frac{dy}{dW_2} = h = 4 [/latex]

– Using the chain rule, the gradient of the loss with respect to $ W_2 $:
[latex]\frac{dL}{dW_2} = \frac{dL}{dy} \cdot \frac{dy}{dW_2} = 4 \cdot 4 = 16 [/latex]

– Now, compute the gradient of $ y $ with respect to $ h $:
[latex]\frac{dy}{dh} = W_2 = 3 [/latex]

– Compute the gradient of $ h $ with respect to $ W_1 $:
[latex] \frac{dh}{dW_1} = \frac{d}{dW_1} f(W_1 \cdot x + b_1) = \frac{d}{dW_1} (2 \cdot x)^2 = 2 \cdot 2 \cdot x = 4 [/latex]

– Using the chain rule, the gradient of the loss with respect to $ W_1 $:
[latex]\frac{dL}{dW_1} = \frac{dL}{dy} \cdot \frac{dy}{dh} \cdot \frac{dh}{dW_1} = 4 \cdot 3 \cdot 4 = 48 [/latex]

Summary

– Forward pass: Calculate the output and the loss.
– Backward pass: Compute the gradients using the chain rule.
– For each layer, calculate the gradient of the loss with respect to its weights by multiplying the gradients of the subsequent layers (starting from the output layer).

Example in Code (PyTorch)

Here’s how you would implement this in PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple model with one hidden layer
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc1 = nn.Linear(1, 1) # W1
self.fc2 = nn.Linear(1, 1) # W2

def forward(self, x):
h = self.fc1(x)
h = h ** 2 # Squaring function as activation
y = self.fc2(h)
return y

model = SimpleModel()

# Input and target
inputs = torch.tensor([[1.0]])
labels = torch.tensor([[10.0]])

# Loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)

# Backward pass and optimization
optimizer.zero_grad()
loss.backward()

# Gradients
for name, param in model.named_parameters():
if param.grad is not None:
print(f'Gradient of {name} is {param.grad}')

# Update weights
optimizer.step()

This code defines a simple model, performs a forward pass to calculate the output and loss, and then uses the backward pass to compute gradients using the chain rule. The gradients are printed to show the calculated values before the weights are updated.

Stochastic Gradient Descent

SGD stands for Stochastic Gradient Descent. It’s a fundamental optimization algorithm used in training machine learning models, particularly in deep learning.

Here’s a simple breakdown:

1. Gradient Descent: Imagine you’re on a mountain and want to get to the lowest point, which represents the minimum of a function (in deep learning, this function represents the error or loss). Gradient Descent is like taking small steps downhill. At each step, you look at the slope (gradient) of the hill at your current position and take a step in the direction that goes downhill.

2. Stochastic: Now, instead of looking at the entire slope of the mountain at once, we take a random sample (a random point) of the slope and take a step based on that. This randomness can make the process faster and can sometimes help escape from local minima (points that are low but not the lowest).

3. Training a Neural Network: In deep learning, we have a neural network with lots of parameters (weights and biases). The goal is to find the best values for these parameters that minimize the error between the predicted outputs and the actual outputs. SGD helps us adjust these parameters by computing the gradient of the error with respect to each parameter and updating them in the direction that decreases the error.

4. Example: Let’s say you’re trying to teach a neural network to recognize handwritten digits. You show it a bunch of images of digits along with their labels (e.g., an image of a “3” labeled as “3”). Initially, the neural network makes random guesses about what the digits are. You use SGD to adjust the parameters (weights and biases) of the network based on the errors it makes. For each image, you compute how much the network’s guess differs from the actual label, and you adjust the parameters a little bit to reduce that difference. You do this for many images (possibly going through the dataset multiple times), gradually improving the network’s ability to correctly recognize digits.

So, in essence, SGD is like a guided downhill walk where you take small steps based on random observations to reach the lowest point (minimum error) efficiently.

hyperparameter tuning

There are many different ways to potentially improve a neural network. Some of the most common include: increasing the number of layers (making the network deeper), increasing the number of hidden units (making the network wider) and changing the learning rate. Because these values are all human-changeable, they’re referred to as hyperparameters) and the practice of trying to find the best hyperparameters is referred to as hyperparameter tuning.

from https://dev.mrdbourke.com/tensorflow-deep-learning/01_neural_network_regression_in_tensorflow/

An example

import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV

# Function to create model, required for KerasClassifier
def create_model(optimizer='adam', init='uniform'):
model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer=init, activation='relu'))
model.add(Dense(8, kernel_initializer=init, activation='relu'))
model.add(Dense(1, kernel_initializer=init, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
return model

# Fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# Load dataset
# Assuming X and y are your features and labels
# For example:
# X = np.array([...])
# y = np.array([...])

# Create model
model = KerasClassifier(build_fn=create_model, verbose=0)

# Define the grid search parameters
param_grid = {
'batch_size': [10, 20, 40],
'epochs': [50, 100],
'optimizer': ['SGD', 'Adam'],
'init': ['uniform', 'normal']
}

# Create GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)

# Fit the model
grid_result = grid.fit(X, y)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, std, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, std, param))
```

Explanation:
1. Model Creation Function: `create_model` defines the structure and compilation of the neural network.
2. KerasClassifier: Wraps the Keras model so it can be used by scikit-learn's `GridSearchCV`.
3. Grid Search Parameters: Defines the hyperparameters and their possible values.
4. GridSearchCV: Performs the grid search over the specified hyperparameters.
5. Fitting the Model: Trains the model on the provided dataset and searches for the best hyperparameters.
6. Results: Prints out the best score and the corresponding hyperparameters, as well as the mean and standard deviation of the scores for each combination.

By using this approach, you can systematically explore a range of hyperparameters to find the combination that yields the best performance for your neural network.

Pytorch

Tensor

Tensors are a specialized data structure that are very similar to arrays and matrices.

Tensors are similar to NumPy’s ndarrays, except that tensors can run on GPUs or other 
hardware accelerators. In fact, tensors and NumPy arrays can often share the same 
underlying memory, eliminating the need to copy data (see Bridge with NumPy). 
Tensors are also optimized for automatic differentiation (we’ll see more about that 
later in the Autograd section). If you’re familiar with ndarrays, you’ll be right at 
home with the Tensor API.

import torch
import numpy as np

data = [[1, 2],[3, 4]]
x_data = torch.tensor(data)

np_array = np.array(data)
x_np = torch.from_numpy(np_array)

x_ones = torch.ones_like(x_data) # retains the properties of x_data
print(f"Ones Tensor: \n {x_ones} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data
print(f"Random Tensor: \n {x_rand} \n")


Ones Tensor:
tensor([[1, 1],
[1, 1]])

Random Tensor:
tensor([[0.4223, 0.1719],
[0.3184, 0.2631]])

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
from torchvision import datasets, transforms

from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

Professor Ha-Chin Yi Home Page

“You can always find the sun within yourself if you will only search.” — Maxwell Maltz

Deep Learning 1