Contents
Learning Sources
deep learning - in general
https://github.com/fchollet/deep-learning-with-python-notebooks
https://github.com/ageron/handson-ml2
Dive into Deep Learning — Dive into Deep Learning 1.0.3 documentation
https://youtu.be/DooxDIRAkPA
https://youtu.be/dafuAz_CV7Q
https://youtu.be/VyWAvY2CF9c
https://youtu.be/WHvWSYKGMDQ
https://www.learnpytorch.io/
https://dev.mrdbourke.com/tensorflow-deep-learning/
https://github.com/mrdbourke/tensorflow-deep-learning/
https://github.com/mrdbourke/pytorch-deep-learning
https://www.youtube.com/channel/UCbfYPyITQ-7l4upoX8nvctg
https://www.youtube.com/c/3blue1brown
https://www.youtube.com/@lexfridman
Neural Networks and Deep Learning Book by Michael Nielsen
For Machine learning or deep learning history & new developments - watch this MIT lecture
By the way, Lex Friman recommend this site for NLP
https://github.com/sebastianruder/NLP-progress
keras
https://realpython.com/python-keras-text-classification/
https://towardsdatascience.com/building-our-first-neural-network-in-keras-bdc8abbc17f5
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
https://keras.io/examples/
https://keras.io/api/applications/
tensorflow keras
https://www.youtube.com/watch?v=B961QM47g64
https://youtu.be/28QbrkRkHlo
https://www.youtube.com/watch?v=Y__gyApx_7c&t=5231s
https://youtu.be/tPYj3fFJGjk
https://youtu.be/VtRLrQ3Ev-U
https://www.youtube.com/watch?v=tpCFfeUEGs8&t=10767s
https://www.youtube.com/watch?v=ZUKz4125WNI
PyTorch
https://www.learnpytorch.io/
https://github.com/mrdbourke/pytorch-deep-learning
https://pytorch.org/tutorials/beginner/basics/intro.html
https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html
Sequence model
https://youtu.be/L8HKweZIOmg
RNN
https://youtu.be/6niqTuYFZLQ
https://youtu.be/ySEx_Bqxvvo
https://youtu.be/S7oA5C43Rbc
https://www.ibm.com/topics/recurrent-neural-networks
https://aws.amazon.com/what-is/recurrent-neural-network/
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
LTSM
https://youtu.be/YCzL96nL7j0
transformer
https://huggingface.co/docs/transformers/en/index
https://youtu.be/XfpMkf4rD6E
https://youtu.be/eMlx5fFNoYc
https://blogs.nvidia.com/blog/what-is-a-transformer-model/
https://aws.amazon.com/what-is/transformers-in-artificial-intelligence/
https://www.datacamp.com/tutorial/how-transformers-work
https://www.youtube.com/watch?v=LWMzyfvuehA
https://www.youtube.com/watch?v=XowwKOAWYoQ
https://www.youtube.com/watch?v=bCz4OMemCcA
NLP using deep learning
https://www.youtube.com/watch?v=Hn3GHHOXKCE
https://www.youtube.com/watch?v=Rf7wvs8ZbP4&t=1853s
Warm-ups
Neural networks: use the examples to automatically infer rules for recognizing patterns.
Perceptrons ( https://images.app.goo.gl/DBBjf95jCtdB4ckG7 ) were developed in 1950s by Frank Rosenblatt. The neuron’s output, 0 or 1 is determined by whether the weighted sum is less than or greater than some threshold value.
the best way to improve a deep learning model is to train it on more data or better data
Previous machine learning techniques—shallow learning—only involved transforming the input data into one or two successive representation spaces, usually via simple transformations such as high-dimensional non-linear projections (SVMs) or decision trees. But the refined representations required by complex problems generally can’t be attained by such techniques. As such, humans had to go to great lengths to make the initial input data more amenable to processing by these methods: they had to manually engineer good layers of representations for their data. This is called feature engineering. Deep learning, on the other hand, completely automates this step: with deep learning, you learn all features in one pass rather than having to engineer them yourself. This has greatly simplified machine learning workflows, often replacing sophisticated multistage pipelines with a single, simple, end-to-end deep learning model.(From Chollet, 2017, Ch1)
What happened is that the gaming market subsidized supercomputing for the next generation of artificial intelligence applications. Sometimes, big things begin as games. Today, the NVIDIA Titan RTX, a GPU that cost $2,500 at the end of 2019, can deliver a peak of 16 teraFLOPS in single precision (16 trillion float32 operations per second). That’s about 500 times more computing power than the world’s fastest supercomputer from 1990, the Intel Touchstone Delta. On a Titan RTX, it takes only a few hours to train an ImageNet model of the sort that would have won the ILSVRC competition around 2012 or 2013. Meanwhile, large companies train deep learning models on clusters of hundreds of GPUs. (From Chollet, 2017, Ch1)
The most popular NVIDIA GPUs for deep learning as of 2024 are the NVIDIA GeForce RTX 3090 and the RTX 4090, depending on the specific requirements and budget considerations.
As of 2024, the prices for the NVIDIA GPUs popular for deep learning vary significantly:The NVIDIA GeForce RTX 3090 is priced around $1,114, but you can find it for slightly less depending on sales and availability (Tom’s Hardware). The NVIDIA RTX 4090 is generally more expensive, costing between $1,650 and $2,178 (Tom’s Hardware
Scalability—Deep learning is highly amenable to parallelization on GPUs or TPUs, so it can take full advantage of Moore’s law. In addition, deep learning models are trained by iterating over small batches of data, allowing them to be trained on datasets of arbitrary size. (The only bottleneck is the amount of parallel computational power available, which, thanks to Moore’s law, is a fast-moving barrier.) (From Chollet, 2017, Ch1)
Picking the right network architecture is more an art than a science; and although there are some best practices and principles you can rely on, only practice can help you become a proper neural-network architect. the three most common use cases of neural networks: binary classification, multiclass classification, and scalar regression.
At its core, a tensor is a container for data—almost always numerical data. So, it’s a container for numbers. You may be already familiar with matrices, which are 2D tensors: tensors are a generalization of matrices to an arbitrary number of dimensions (note that in the context of tensors, a dimension is often called an axis). (From Chollet, 2017, Ch2)
The test-set accuracy turns out to be 97.8%—that’s quite a bit lower than the training set accuracy. This gap between training accuracy and test accuracy is an example of overfitting: the fact that machine-learning models tend to perform worse on new data than on their training data. (From Chollet, 2017, Ch2)
Vector data— 2D tensors of shape (samples, features) Timeseries data or sequence data— 3D tensors of shape (samples, timesteps, features) Images— 4D tensors of shape (samples, height, width, channels) or (samples, channels, height, width) Video— 5D tensors of shape (samples, frames, height, width, channels) or (samples, frames, channels, height, width) (From Chollet, 2017, Ch2) Tensors, including special tensors that store the network’s state (variables) Tensor operations such as addition, relu, matmul Backpropagation, a way to compute the gradient of mathematical expressions (handled in TensorFlow via the GradientTape object) Layers, which are combined into a model A loss function, which defines the feedback signal used for learning An optimizer, which determines how learning proceeds Metrics to evaluate model performance, such as accuracy A training loop that performs mini-batch stochastic gradient descent (From Chollet, 2021, Ch3)
Why non-linear activation function? Without non-linear activation functions, the entire neural network would behave like a single-layer perceptron, regardless of its depth. This is because a composition of linear functions is still a linear function. Non-linear activation functions allow neural networks to learn complex, non-linear relationships between inputs and outputs, which is crucial for modeling real-world data that is inherently non-linear. Affine transformation is a linear mapping method that preserves points, straight lines, and planes. Why do we need an optimize in deep learning? In deep learning, an optimizer is crucial because it determines how the model learns and improves from the data it is trained on. Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD). Different optimizers converge at different speeds for different kinds of data and models. For example, some optimizers like Adam are known for being fast to converge in many scenarios compared to traditional methods like stochastic gradient descent (SGD). Minimizing Loss Function: The primary role of an optimizer is to minimize the loss function, which measures the difference between the predicted outputs of the model and the actual values. By minimizing the loss, the optimizer helps the model to improve its accuracy and performance. Without an optimizer, the training process might not converge, meaning the model might not reach a point where the loss is minimized adequately. Optimizers guide the learning process to ensure that it converges to an optimal solution within a reasonable time frame.
What is the difference between optimizer and objective function
Optimizer and Objective Function are two related but distinct concepts in Machine
Learning and Optimization:
Objective Function (also called Loss Function or Cost Function):
efines the goal or objective of the optimization problem
Measures the difference between the model's predictions and the actual true labels
Examples: Mean Squared Error (MSE), Cross-Entropy, Mean Absolute Error (MAE)
Optimizer:
A algorithm that searches for the optimal parameters of a model
Adjusts the model's parameters to minimize the Objective Function
Examples: Stochastic Gradient Descent (SGD), Adam, RMSprop, Gradient Descent (GD)
In summary:
The Objective Function defines what we want to optimize (e.g., minimize the error)
The Optimizer is the algorithm that performs the optimization (e.g., adjusts the
model's parameters to minimize the error)
For instance, you’ll use binary cross-entropy for a two-class classification problem,
categorical cross-entropy for a many-class classification problem, mean-squared error
for a regression problem, connectionist temporal classification (CTC) for a
sequence-learning problem, and so on. Only when you’re working on truly new
research problems will you have to develop your own objective functions.
What is a learning decay? In deep learning, learning rate decay, or simply learning decay, refers to the technique of reducing the learning rate over time during training. The learning rate is a crucial hyperparameter that determines the size of the steps that the optimizer takes towards the minimum of the loss function. By adjusting the learning rate throughout the training process, learning rate decay aims to achieve more effective and reliable training results.
Choosing the right activation function, such as sigmoid, softmax, or others, depends on the specific requirements of your neural network's architecture and the nature of the problem you're trying to solve. Here's a breakdown of when to use each: Sigmoid Activation Function - Range: Outputs values between 0 and 1. - Use Cases: - Binary Classification: Commonly used in the output layer when the task is binary classification. It gives a probability-like output. - Hidden Layers: Occasionally used in hidden layers, but less common now due to issues like vanishing gradients. - Pros: - Smooth gradient. - Outputs can be interpreted as probabilities. - Cons: - Vanishing gradient problem, especially for deep networks. - Outputs not zero-centered, which can slow down convergence. Softmax Activation Function - Range: Outputs a probability distribution over multiple classes, where the sum of the probabilities is 1. - Use Cases: - Multiclass Classification: Used in the output layer for multiclass lassification problems. - Output Layers: Effective in the final layer where a probabilistic interpretation is needed across multiple categories. - Pros: - Provides a clear probabilistic interpretation. - Suitable for mutually exclusive class outputs. - Cons: - Computationally more expensive than simpler functions. - Can be less interpretable in the presence of non-mutually exclusive classes. Other Activation Functions - ReLU (Rectified Linear Unit): - Range: Outputs values from 0 to infinity. - Use Cases: - Very popular for hidden layers in deep learning models due to its simplicity and effectiveness. - Pros: - Reduces the likelihood of the vanishing gradient problem. - Computationally efficient. - Cons: - Can cause dead neurons (outputting zero for all inputs). - Tanh (Hyperbolic Tangent): - Range: Outputs values between -1 and 1. - Use Cases: - Sometimes used in hidden layers where zero-centered outputs are desired. - Pros: - Zero-centered, which can help with convergence. - Strong gradients for inputs in the range [-1, 1]. - Cons: - Still susceptible to the vanishing gradient problem. - Leaky ReLU: - Range: Outputs values from -infinity to infinity. - Use Cases: - Similar to ReLU but addresses the dead neuron problem by allowing a small, non-zero gradient when the unit is not active. - Pros: - Helps mitigate the dead neuron issue. - Cons: - Requires tuning of the negative slope parameter. Choosing the Right Activation Function 1. Task Nature: - For binary classification: Sigmoid. - For multiclass classification: Softmax. - For hidden layers: ReLU or Leaky ReLU are generally good starting points. 2. Network Depth: - For deep networks, prefer activation functions that mitigate vanishing gradients like ReLU and its variants. 3. Output Interpretation: - For probabilistic outputs: Sigmoid or Softmax. 4. Experimentation: - Sometimes the best choice can be task-specific, requiring empirical testing to see what works best for your specific dataset and model architecture. By considering these factors, you can make an informed decision on which activation function to use in your neural network models.
Regularization in deep learning is a technique used to prevent overfitting, which occurs when a model learns the training data too well and performs poorly on new, unseen data. Regularization helps the model generalize better by adding a penalty to the loss function, discouraging the model from becoming too complex. Here are a few common regularization techniques: 1. L1 and L2 Regularization* These add a penalty term to the loss function based on the size of the weights. - L1 regularization (Lasso) adds the absolute value of the weights to the loss function. - L2 regularization (Ridge) adds the squared value of the weights to the loss function. 2. Dropout Randomly sets a fraction of the input units to 0 at each update during training, which helps prevent the model from becoming too reliant on specific neurons. 3. Early Stopping: Monitors the model's performance on a validation set and s tops training when performance no longer improves, preventing the model from overfitting the training data. 4. Data Augmentation: Increases the diversity of the training data by applying random transformations like rotation, scaling, and flipping. Example of L2 Regularization in Keras from keras.models import Sequential from keras.layers import Dense from keras.regularizers import l2 from keras.datasets import mnist from keras.utils import to_categorical # Load and preprocess the MNIST dataset (train_images, train_labels), (test_images, test_labels) = mnist.load_data() train_images = train_images.reshape((60000, 28 * 28)).astype('float32') / 255 test_images = test_images.reshape((10000, 28 * 28)).astype('float32') / 255 train_labels = to_categorical(train_labels) test_labels = to_categorical(test_labels) # Define the model model = Sequential([ Dense(512, activation='relu', input_shape=(28 * 28,), kernel_regularizer=l2(0.001)), Dense(10, activation='softmax') ]) # Compile the model model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # Train the model history = model.fit(train_images, train_labels, epochs=10, batch_size=128, validation_split=0.2) # Evaluate the model test_loss, test_acc = model.evaluate(test_images, test_labels) print('Test accuracy:', test_acc) In this example: - We use the MNIST dataset, a standard dataset for digit recognition. - The model has one hidden layer with 512 units and L2 regularization applied to its weights with a regularization factor of 0.001. - The `kernel_regularizer=l2(0.001)` part adds the L2 penalty to the loss function, helping to prevent overfitting. - The model is trained for 10 epochs with a batch size of 128 and 20% of the training data is used for validation (`validation_split=0.2`). Regularization helps the model to not only fit the training data but also generalize well to new, unseen data.
In machine learning, parameters and hyperparameters are crucial concepts, but they refer to different aspects of the model. Parameters Parameters are internal variables of the model that are learned from the training data. They are the values that the learning algorithm adjusts during training to minimize the loss function. For example, in a neural network, the weights and biases of the neurons are parameters. Parameters are updated through optimization techniques like gradient descent. Example: - In a linear regression model [latex] y = wx + b [/latex], [latex] w [/latex] (weight) and [latex] b [/latex] (bias) are parameters. - In a neural network, the weights and biases of the layers are parameters. Hyperparameters Hyperparameters are external to the model and set before the learning process begins. They control the training process and the model architecture. Hyperparameters are not learned from the training data but are often set through experimentation and tuning to find the best performance. Example: - Learning rate: Determines the step size during gradient descent. - Number of epochs: The number of times the learning algorithm will work through the entire training dataset. - Batch size: The number of training examples utilized in one iteration. - Number of layers and units in each layer in a neural network. - Regularization parameters like [latex] \lambda\ [/latex] in L2 regularization. Key Differences - Learning Process: Parameters are learned from the data, while hyperparameters are set before training. - Role: Parameters define the model's final configuration, whereas hyperparameters guide the learning process and model structure. - Adjustment: Parameters are adjusted by the training algorithm, while hyperparameters are typically adjusted through techniques like grid search or random search. Example in Context of Neural Networks from keras.models import Sequential from keras.layers import Dense from keras.optimizers import Adam from keras.datasets import mnist from keras.utils import to_categorical # Load and preprocess the MNIST dataset (train_images, train_labels), (test_images, test_labels) = mnist.load_data() train_images = train_images.reshape((60000, 28 * 28)).astype('float32') / 255 test_images = test_images.reshape((10000, 28 * 28)).astype('float32') / 255 train_labels = to_categorical(train_labels) test_labels = to_categorical(test_labels) # Define the model model = Sequential([ Dense(512, activation='relu', input_shape=(28 * 28,)), Dense(10, activation='softmax') ]) # Define hyperparameters learning_rate = 0.001 batch_size = 128 epochs = 10 # Compile the model with the Adam optimizer (setting the learning rate) model.compile(optimizer=Adam(learning_rate=learning_rate), loss='categorical_crossentropy', metrics=['accuracy']) # Train the model history = model.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size, validation_split=0.2) # Evaluate the model test_loss, test_acc = model.evaluate(test_images, test_labels) print('Test accuracy:', test_acc) In this example: - Parameters: The weights and biases of the layers in the model, which are learned during the training process. - Hyperparameters: - Learning rate (`learning_rate = 0.001`) - Number of epochs (`epochs = 10`) - Batch size (`batch_size = 128`) Hyperparameters are set before training starts, and they influence how the parameters are adjusted during the training process.
As of now, both PyTorch and TensorFlow are highly popular and widely used frameworks for deep learning, each with its own strengths and community support. However, some trends can be observed: 1. PyTorch: - Adoption in Research: PyTorch is particularly favored in the research community due to its dynamic computation graph, which makes it easier to debug and develop new models. - Ease of Use: It is often considered more Pythonic and intuitive, which makes it easier for beginners and researchers to work with. - Growing Industry Adoption: While it started with a strong presence in academia, PyTorch is increasingly being adopted in industry settings as well. 2. TensorFlow: - Industry Standard: TensorFlow has been widely adopted in the industry, especially for production and deployment. It provides robust tools for scalability and deployment, including TensorFlow Serving and TensorFlow Lite for mobile and embedded devices. - Comprehensive Ecosystem: TensorFlow has a comprehensive ecosystem that includes TensorFlow Extended (TFX) for end-to-end machine learning pipelines, TensorFlow.js for running models in the browser, and TensorFlow Hub for sharing pre-trained models. - TensorFlow 2.0: The release of TensorFlow 2.0 made the framework more user-friendly by integrating Keras as its high-level API, which has narrowed the gap in ease of use compared to PyTorch. Overall, the choice between PyTorch and TensorFlow often comes down to specific use cases and personal or team preferences. Researchers may prefer PyTorch for its flexibility, while companies looking to deploy models at scale might lean towards TensorFlow for its production-ready capabilities. Both frameworks continue to evolve rapidly, with active development and new features being added regularly.
deep learning math
Sure, let’s break down the key mathematical concepts in deep learning with equations and explanations:
1. Linear Algebra:
– Scalars, Vectors, Matrices, and Tensors:
– Scalars: Single numerical value (e.g., [latex]a[/latex]).
– Vectors: Ordered array of scalars (e.g., [latex]\mathbf{v} = [v_1,v_2,..v_n][/latex]).
– Matrices: 2D array of scalars (e.g., [latex]A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}[/latex]).
– Tensors: Generalization of matrices to higher dimensions.
– Matrix Multiplication:
– [latex]C = AB[/latex] where [latex]C_{ij} = \sum_k A_{ik}B_{kj}[/latex].
– Transpose of a Matrix:
– [latex]A^T[/latex] where [latex]A^T_{ij} = A_{ji}[/latex].
– Dot Product:
– [latex]c = \mathbf{v}_1 \cdot \mathbf{v}_2 = \sum_{i} v_{1i}v_{2i}[/latex].
– Hadamard Product:
– [latex]C = A \odot B[/latex] where [latex]C_{ij} = A_{ij} \times B_{ij}[/latex].
2. Calculus:
– Derivatives:
– The derivative of a function [latex]f(x)[/latex] with respect to [latex]x[/latex], denoted as [latex]f'(x)[/latex], represents the rate of change of [latex]f[/latex] at point [latex]x[/latex].
– Chain Rule:
– If [latex]f(x) = g(h(x))[/latex], then [latex]f'(x) = g'(h(x)) \cdot h'(x)[/latex].
– Gradient Descent:
– Update rule: [latex]x_{t+1} = x_t – \alpha \nabla f(x_t)[/latex], where [latex]f(x_t)[/latex] is the objective function, [latex]\nabla f(x_t)[/latex] is its gradient, and [latex]\alpha[/latex] is the learning rate.
– Partial Derivatives and Gradients:
– If [latex] f(x_1,x_2,x_3,..x_n) [/latex], the gradient [latex]\nabla f = \left[\frac{\partial f}{\partial x_1},\frac{\partial f}{\partial x_2},\frac{\partial f}{\partial x_n} \right][/latex].
3. Probability and Statistics:
– Probability Distributions:
– Gaussian (Normal), Bernoulli, etc.
– Expectation and Variance:
– Expectation: [latex]E[X] = \sum_x x P(X=x)[/latex].
– Variance: [latex]Var(X) = E[(X – \mu)^2][/latex].
– Maximum Likelihood Estimation (MLE):
– Estimating parameters that maximize the likelihood of observing the data.
– Bayesian Inference and Bayes’ Theorem:
– [latex]P(A|B) = \frac{P(B|A)P(A)}{P(B)}[/latex].
4. Optimization:
– Gradient Descent:
– Update rule: [latex]x_{t+1} = x_t – \alpha \nabla f(x_t)[/latex].
– Stochastic Gradient Descent (SGD):
– Mini-batch update: [latex]x_{t+1} = x_t – \alpha \nabla f(x_t; \mathcal{D}_t)[/latex], where [latex]\mathcal{D}_t[/latex] is a random mini-batch.
– Adam, RMSProp, etc.:
– Advanced optimization algorithms with adaptive learning rates.
5. Neural Networks:
– Activation Functions:
– Sigmoid: [latex]f(x) = \frac{1}{1 + e^{-x}}[/latex].
– ReLU: [latex]f(x) = \max(0, x)[/latex].
– Feedforward Propagation:
– [latex]z = Wx + b[/latex], [latex]a = \text{activation}(z)[/latex].
– Backpropagation:
– Compute gradients of the loss with respect to weights using the chain rule.
– CNNs, RNNs, LSTMs, GRUs:
– Architectures for handling specific types of data and learning tasks.
6. Loss Functions:
– Mean Squared Error (MSE):
– [latex]L(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2[/latex].
– Cross-Entropy Loss:
– [latex]L(y, \hat{y}) = -\sum_{i=1}^{n} y_i \log(\hat{y}_i)[/latex].
7. Regularization:
– L1 and L2 Regularization:
– Add penalty terms to the loss function: [latex]L_{\text{regularized}} = L_{\text{original}} + \lambda \| \theta \|_p[/latex].
– Dropout:
– Randomly drop units during training to prevent overfitting.
– Batch Normalization:
– Normalize inputs of each layer to speed up training and reduce overfitting.
Understanding these equations and concepts provides a solid foundation for diving deeper into the mathematics behind deep learning.
tensorflow & keras
tensor
In TensorFlow, a tensor is a multi-dimensional array used to represent data. Tensors are the primary data structure in TensorFlow, and they are used for both input data and the parameters of machine learning models. They can have various dimensions, known as ranks, such as:
- Rank 0 Tensor: A scalar (e.g., 5).
- Rank 1 Tensor: A vector (e.g., [1, 2, 3]).
- Rank 2 Tensor: A matrix (e.g., [[1, 2], [3, 4]]).
- Higher Rank Tensors: Arrays with three or more dimensions.
In TensorFlow, tensors have several key attributes that define their properties and how they can be used in computations. Here are the key attributes of a tensor:
1. Rank
The rank of a tensor refers to the number of dimensions it has. For example:
– A scalar has a rank of 0.
– A vector has a rank of 1.
– A matrix has a rank of 2.
– Higher-dimensional arrays have ranks 3 and above.
2. Shape
The shape of a tensor is a tuple that describes the size of each dimension. For example:
– A scalar has an empty shape `()`.
– A vector with 5 elements has a shape `(5,)`.
– A matrix with 3 rows and 4 columns has a shape `(3, 4)`.
– A 3-dimensional tensor with dimensions 2, 3, and 4 has a shape `(2, 3, 4)`.
3. Data Type (dtype)
The data type of a tensor specifies the type of values it holds, such as:
– `tf.float32`: 32-bit floating point.
– `tf.int32`: 32-bit integer.
– `tf.bool`: Boolean values.
4. Device
The device attribute specifies the hardware device where the tensor is stored and on which computations are performed, such as:
– CPU: `/device:CPU:0`
– GPU: `/device:GPU:0`
Example in TensorFlow -I ran this in deepnote.com
import tensorflow as tf tensor = tf.constant([[1.0, 2.0], [3.0, 4.0]], dtype=tf.float32) print(tensor) print("\nRank:", tf.rank(tensor).numpy()) print("Shape:", tensor.shape) print("Data Type:", tensor.dtype) print("Device:", tensor.device) Output: tf.Tensor( [[1. 2.] [3. 4.]], shape=(2, 2), dtype=float32) Rank: 2 Shape: (2, 2) Data Type: <dtype: 'float32'> Device: /job:localhost/replica:0/task:0/device:CPU:0
import tensorflow as tf # Creating Tensors tensor_a = tf.constant([[1, 2, 3], [4, 5, 6]]) tensor_b = tf.constant([[7, 8, 9], [10, 11, 12]]) # Basic Tensor Operations # Addition tensor_add = tf.add(tensor_a, tensor_b) print("Addition:\n", tensor_add.numpy()) # Subtraction tensor_sub = tf.subtract(tensor_a, tensor_b) print("Subtraction:\n", tensor_sub.numpy()) # Element-wise Multiplication tensor_mul = tf.multiply(tensor_a, tensor_b) print("Element-wise Multiplication:\n", tensor_mul.numpy()) # Matrix Multiplication tensor_matmul = tf.matmul(tensor_a, tensor_b, transpose_b=True) print("Matrix Multiplication:\n", tensor_matmul.numpy()) # Division tensor_div = tf.divide(tensor_a, tensor_b) print("Division:\n", tensor_div.numpy()) # Creating Tensors with different data types tensor_float = tf.constant([[1.1, 2.2], [3.3, 4.4]], dtype=tf.float32) tensor_int = tf.constant([[1, 2], [3, 4]], dtype=tf.int32) # Reshaping Tensors tensor_reshaped = tf.reshape(tensor_a, [3, 2]) print("Reshaped Tensor:\n", tensor_reshaped.numpy()) # Transposing Tensors tensor_transposed = tf.transpose(tensor_a) print("Transposed Tensor:\n", tensor_transposed.numpy()) # Reducing dimensions tensor_sum = tf.reduce_sum(tensor_a) print("Sum of all elements:\n", tensor_sum.numpy()) tensor_max = tf.reduce_max(tensor_a) print("Maximum element:\n", tensor_max.numpy()) # Broadcasting tensor_c = tf.constant([1, 2, 3]) tensor_broadcasted_add = tf.add(tensor_a, tensor_c) print("Broadcasted Addition:\n", tensor_broadcasted_add.numpy()) # Applying functions element-wise tensor_squared = tf.square(tensor_a) print("Element-wise Squaring:\n", tensor_squared.numpy()) # Creating a Tensor from NumPy array import numpy as np np_array = np.array([[1, 2, 3], [4, 5, 6]]) tensor_from_np = tf.convert_to_tensor(np_array, dtype=tf.int32) print("Tensor from NumPy array:\n", tensor_from_np.numpy()) output: Addition: [[ 8 10 12] [14 16 18]] Subtraction: [[-6 -6 -6] [-6 -6 -6]] Element-wise Multiplication: [[ 7 16 27] [40 55 72]] Matrix Multiplication: [[ 50 68] [122 167]] Division: [[0.14285714 0.25 0.33333333] [0.4 0.45454545 0.5 ]] Reshaped Tensor: [[1 2] [3 4] [5 6]] Transposed Tensor: [[1 4] [2 5] [3 6]] Sum of all elements: 21 Maximum element: 6 Broadcasted Addition: [[2 4 6] [5 7 9]] Element-wise Squaring: [[ 1 4 9] [16 25 36]] Tensor from NumPy array: [[1 2 3] [4 5 6]]
string = tf.Variable("this is a string", tf.string) number = tf.Variable(324, tf.int16) floating = tf.Variable(3.567, tf.float64)
from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM from tensorflow.keras.layers import Dense from tensorflow.keras.layers import Flatten from tensorflow.keras.layers import TimeDistributed from tensorflow.keras.layers import Conv1D from tensorflow.keras.layers import MaxPooling1D
keras examples
From tensorflow web site https://www.tensorflow.org/tutorials
# The example from the famous book by Chollet. from tensorflow import keras from tensorflow.keras import layers model = keras.Sequential([ layers.Dense(512, activation="relu"), # fully connected layer layers.Dense(10, activation="softmax") # 10-way softmax classification layer, returning an array of 10 probability scores ]) model.compile(optimizer="rmsprop", # the model will update itself based on the training data it sees loss="sparse_categorical_crossentropy", # How the model will be able to measure its performance on the training data metrics=["accuracy"]) # the fraction of the images that were correctly classified # scaling - the model expects and scaling it so that all values are in the [0, 1] interval. train_images = train_images.reshape((60000, 28 * 28)) train_images = train_images.astype("float32") / 255 test_images = test_images.reshape((10000, 28 * 28)) test_images = test_images.astype("float32") / 255 model.fit(train_images, train_labels, epochs=5, batch_size=128) >>> test_digits = test_images[0:10] >>> predictions = model.predict(test_digits) >>> predictions[0] array([1.0726176e-10, 1.6918376e-10, 6.1314843e-08, 8.4106023e-06, 2.9967067e-11, 3.0331331e-09, 8.3651971e-14, 9.9999106e-01, 2.6657624e-08, 3.8127661e-07], dtype=float32) # This first test digit has the highest probability score (0.99999106, almost 1) at index 7, so according to our model, it must be a 7: >>> predictions[0].argmax() 7 >>> predictions[0][7] 0.99999106 >>> test_labels[0] 7 >>> test_loss, test_acc = model.evaluate(test_images, test_labels) >>> print(f"test_acc: {test_acc}") test_acc: 0.9785 Source: Chollet 2022, Deep Learning with Python, Second Edition, CH 2
# A simple example of Keras by Chat GPT

- Importing modules:
from keras.models import Sequential
: Imports theSequential
model API from Keras, which is useful for creating a linear stack of layers (simple models).from keras.layers import Dense
: Imports theDense
layer, which is a fully connected neural network layer.
- Model initialization:
model = Sequential()
: Creates an instance of a Sequential model. This will be the container into which we will add our layers.
- Adding layers:
model.add(Dense(12, activation='relu', input_shape=(n_features,)))
: Adds a fully connected layer (Dense
) with 12 neurons. Theactivation='relu'
argument specifies the Rectified Linear Unit activation function. Theinput_shape
should match the number of features in your dataset (excluding the target variable).
- Adding a hidden layer:
model.add(Dense(8, activation='relu'))
: Adds another dense layer with 8 neurons, also with ReLU activation. This is the hidden layer.
- Adding the output layer:
model.add(Dense(1, activation='sigmoid'))
: Adds the output layer with a single neuron because it’s a binary classification.activation='sigmoid'
is used because it outputs a probability between 0 and 1, which is ideal for binary classification.
- Compiling the model:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
: Compiles the model for training.loss='binary_crossentropy'
is the loss function commonly used for binary classification.optimizer='adam'
is an efficient gradient descent algorithm, and we choose to monitoraccuracy
during training.
- Fitting the model:
model.fit(X_train, y_train, epochs=50, batch_size=1, verbose=1)
: Fits the model on the training data.epochs=50
means the entire dataset is passed forward and backward through the neural network 50 times.batch_size=1
indicates we are using stochastic gradient descent (one sample per gradient update).verbose=1
shows the training progress.
- Evaluating the model:
model.evaluate(X_test, y_test, verbose=0)
: Evaluates the model on the testing set quietly (verbose=0
). Returns loss and accuracy.
- Print accuracy:
print('Accuracy: %.2f' % accuracy)
: Prints the accuracy of the model after evaluation.
# Another example
This code is very basic and meant for educational purposes. In a real-world scenario, you’d need to preprocess your data, possibly scale it, and handle other aspects like model validation and hyperparameter tuning.

- Import Libraries:
- Import necessary modules for data manipulation, model creation, and evaluation.
- Load Dataset:
- The Iris dataset is loaded from
scikit-learn
. It contains 4 features and 3 classes of iris species.
- The Iris dataset is loaded from
- Preprocess Data:
- StandardScaler: Scales the features to have zero mean and unit variance, which helps in faster convergence of neural networks.
- OneHotEncoder: Since this is a multi-class classification problem (3 classes), the target variable
y
is one-hot encoded. This converts it into a format suitable for softmax classification. - train_test_split: Splits the dataset into training and testing sets with 70% training and 30% testing.
- Build Neural Network Model:
- A
Sequential
model is created, followed by adding two hidden layers with 10 neurons each and ReLU activation. The output layer has 3 neurons (one for each class) with softmax activation, which is used for multi-class classification.
- A
- Compile the Model:
- The model is compiled with the Adam optimizer, categorical crossentropy loss function (suitable for multi-class classification problems), and accuracy as a metric to track during training.
- Train the Model:
- The model is trained for 100 epochs with a batch size of 5. This means the model will see the training data 100 times and update the weights in batches of 5 samples each.
- Evaluate the Model:
- Finally, the model’s performance is evaluated on the test set. Loss and accuracy are printed out. Lower loss and higher accuracy indicate better model performance.
When you run this code, it will display the accuracy and loss after training and testing, giving you insights into how well the model has learned to classify the iris species. You can adjust the number of epochs, batch size, and model architecture to see how these changes affect model performance.
### Example #feature normalization (or scaling) normalized_feature = keras.utils.normalize(X.values) # Import train_test_split function from sklearn.model_selection from sklearn.model_selection import train_test_split # Split up the data into a training set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101) # Build the Network from tensorflow import keras from keras.models import Sequential #from tensorflow.keras.models import Sequential from keras.layers import Dense ## Build Model (Building a three layer network - with one hidden layer) model = Sequential() model.add(Dense(4, input_dim=4, activation ='relu')) # You don't have to specify input size. Just define the hidden layers model.add(Dense(3, activation='relu')) model.add(Dense(1)) # Compile Model model.compile(optimizer='adam', loss='mse', metrics=['mse']) # Fit the Model history = model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = 32) #inspect the model model.summary() model.evaluate(X_test, y_test)[1] # predict SALES using the test data test_predictions = model.predict(X_test).flatten()
### Example #Create a scaler model that is fit on the input data. scaler = StandardScaler().fit(X_data) #Scale the numeric feature variables X_data = scaler.transform(X_data) #Convert target variable as a one-hot-encoding array Y_data = tf.keras.utils.to_categorical(Y_data,3) #Split training and test data X_train,X_test,Y_train,Y_test = train_test_split( X_data, Y_data, test_size=0.10) from tensorflow import keras #Number of classes in the target variable NB_CLASSES=3 #Create a sequencial model in Keras model = tf.keras.models.Sequential() #Add the first hidden layer model.add(keras.layers.Dense(128, #Number of nodes input_shape=(4,), #Number of input variables name='Hidden-Layer-1', #Logical name activation='relu')) #activation function #Add a second hidden layer model.add(keras.layers.Dense(128, name='Hidden-Layer-2', activation='relu')) #Add an output layer with softmax activation model.add(keras.layers.Dense(NB_CLASSES, name='Output-Layer', activation='softmax')) #Compile the model with loss & metrics model.compile(loss='categorical_crossentropy', metrics=['accuracy']) #Print the model meta-data model.summary() #Make it verbose so we can see the progress VERBOSE=1 #Setup Hyper Parameters for training #Set Batch size BATCH_SIZE=16 #Set number of epochs EPOCHS=10 #Set validation split. 20% of the training data will be used for validation #after each epoch VALIDATION_SPLIT=0.2 print("\nTraining Progress:\n------------------------------------") #Fit the model. This will perform the entire training cycle, including #forward propagation, loss computation, backward propagation and gradient descent. #Execute for the specified batch sizes and epoch #Perform validation after each epoch history=model.fit(X_train, Y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, verbose=VERBOSE, validation_split=VALIDATION_SPLIT) print("\nAccuracy during Training :\n------------------------------------") import matplotlib.pyplot as plt #Plot accuracy of the model after each epoch. pd.DataFrame(history.history)["accuracy"].plot(figsize=(8, 5)) plt.title("Accuracy improvements with Epoch") plt.show() #Evaluate the model against the test dataset and print results print("\nEvaluation against Test Dataset :\n------------------------------------") model.evaluate(X_test,Y_test)
### Example https://elitedatascience.com/keras-tutorial-deep-learning-in-python # 3. Import libraries and modules import numpy as np np.random.seed(123) # for reproducibility from keras.models import Sequential from keras.layers import Dense, Dropout, Activation, Flatten from keras.layers import Convolution2D, MaxPooling2D from keras.utils import np_utils from keras.datasets import mnist # 4. Load pre-shuffled MNIST data into train and test sets (X_train, y_train), (X_test, y_test) = mnist.load_data() # 5. Preprocess input data X_train = X_train.reshape(X_train.shape[0], 28, 28, 1) X_test = X_test.reshape(X_test.shape[0], 28, 28, 1) X_train = X_train.astype('float32') X_test = X_test.astype('float32') X_train /= 255 X_test /= 255 # 6. Preprocess class labels Y_train = np_utils.to_categorical(y_train, 10) Y_test = np_utils.to_categorical(y_test, 10) # 7. Define model architecture model = Sequential() model.add(Convolution2D(32, (3,3), activation='relu', input_shape=(28,28,1))) model.add(Convolution2D(32, (3,3), activation='relu')) model.add(MaxPooling2D(pool_size=(2,2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax')) # 8. Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # 9. Fit model on training data model.fit(X_train, Y_train, batch_size=32, epochs=10, verbose=1) # 10. Evaluate model on test data score = model.evaluate(X_test, Y_test, verbose=0)
batches
In deep learning, data batching is a technique used to improve the efficiency and performance of training models. Instead of processing the entire dataset at once, the dataset is divided into smaller chunks called batches. Each batch is processed independently through the neural network during training. This approach has several benefits, including reducing memory usage and enabling faster and more stable convergence of the model.
Concept of Data Batches
1. Batch Size:
– The batch size is the number of samples processed before the model’s internal parameters (weights and biases) are updated.
– Common batch sizes are powers of 2, such as 32, 64, 128, etc.
2. Epoch:
– An epoch is one complete pass through the entire training dataset.
– If you have a dataset with 1,000 samples and a batch size of 100, it will take 10 batches to complete one epoch.
3. Iterations:
– An iteration refers to one update of the model’s parameters. It corresponds to processing one batch of data.
– If you have a dataset with 1,000 samples, a batch size of 100, and you train for 10 epochs, you will have 100 iterations (10 epochs * 10 batches per epoch).
Benefits of Using Data Batches
1. Memory Efficiency:
– Processing the entire dataset at once (batch size = dataset size) might not fit into memory, especially with large datasets. Batching allows training on smaller chunks that fit in memory.
2. Computational Efficiency:
– Modern hardware, such as GPUs, are optimized for batch processing. Using batches can lead to more efficient use of computational resources.
3. Stable Training:
– Batching helps to smooth out the gradient updates, which can lead to more stable training and better convergence.
Example of Data Batching in Practice
Let’s consider a simple example using TensorFlow and the MNIST dataset, which consists of 60,000 training images of handwritten digits.
import tensorflow as tf # Load the MNIST dataset mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() # Normalize the data x_train, x_test = x_train / 255.0, x_test / 255.0 # Create a simple model model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10) ]) # Compile the model model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) # Train the model with a batch size of 32 model.fit(x_train, y_train, epochs=5, batch_size=32) # Evaluate the model model.evaluate(x_test, y_test, verbose=2)
Explanation
1. Loading and Normalizing Data:
– The MNIST dataset is loaded and normalized to have values between 0 and 1.
2. Model Creation:
– A simple neural network model is created with one hidden layer and an output layer.
3. Compiling the Model:
– The model is compiled with an optimizer, loss function, and metrics.
4. Training with Batches:
– The model is trained using `model.fit()` with a specified batch size of 32.
– This means during each epoch, the model will process 32 samples at a time, update the parameters, and then proceed to the next batch until the entire dataset is covered.
5. Evaluation:
– After training, the model is evaluated on the test set.
Using batches allows for more efficient training and can lead to better performance and faster convergence compared to processing the entire dataset at once.
broadcasting
Broadcasting is a powerful mechanism that allows numpy and TensorFlow to perform arithmetic operations on arrays (tensors) of different shapes. It does this by virtually expanding the smaller array along the mismatched dimensions so that they have compatible shapes. Broadcasting makes many arithmetic operations easier to write and understand.
Broadcasting Rules
1. Dimensions Compatibility: Two dimensions are compatible when:
– They are equal, or
– One of them is 1.
2. Align from the Right: When comparing two arrays, start with the trailing (right-most) dimensions and work your way left. If the dimensions are not compatible, broadcasting cannot be performed.
How Broadcasting Works
When performing operations on two tensors, broadcasting happens as follows:
– Expand dimensions: If one tensor has fewer dimensions than the other, leading dimensions (on the left) of size 1 are added to the smaller tensor.
– Stretch dimensions: Dimensions of size 1 in the smaller tensor are virtually expanded to match the size of the corresponding dimension in the larger tensor.
Example 1: Adding a Scalar to a Tensor
import tensorflow as tf # Tensor tensor_a = tf.constant([[1, 2, 3], [4, 5, 6]], dtype=tf.float32) # Scalar scalar_b = tf.constant(2, dtype=tf.float32) # Broadcasting addition result = tensor_a + scalar_b print("Broadcasting Addition:\n", result.numpy()) Output: Broadcasting Addition: [[3. 4. 5.] [6. 7. 8.]]
Explanation: The scalar `2` is broadcasted to match the shape of `tensor_a`, effectively creating an array `[[2, 2, 2], [2, 2, 2]]` before performing element-wise addition.
Example 2: Adding a Vector to a Matrix
# Matrix tensor_a = tf.constant([[1, 2, 3], [4, 5, 6]], dtype=tf.float32) # Vector vector_b = tf.constant([1, 2, 3], dtype=tf.float32) # Broadcasting addition result = tensor_a + vector_b print("Broadcasting Addition:\n", result.numpy()) Output: Broadcasting Addition: [[2. 4. 6.] [5. 7. 9.]]
Explanation: The vector `[1, 2, 3]` is broadcasted to match the shape of `tensor_a`, effectively creating an array `[[1, 2, 3], [1, 2, 3]]` before performing element-wise addition.
Example 3: Matrix and Higher-Dimensional Tensor
# 2D Tensor (Matrix) tensor_a = tf.constant([[1, 2], [3, 4], [5, 6]], dtype=tf.float32) # 1D Tensor (Vector) tensor_b = tf.constant([1, 2], dtype=tf.float32) # Broadcasting addition result = tensor_a + tensor_b print("Broadcasting Addition:\n", result.numpy()) ``` Output: Broadcasting Addition: [[2. 4.] [4. 6.] [6. 8.]]
Explanation: The vector [1, 2] is broadcasted across the rows of the matrix tensor_a.
Summary
Broadcasting simplifies arithmetic operations on tensors of different shapes by automatically expanding smaller tensors along the mismatched dimensions. The main rules are that dimensions must either match or be of size 1, and TensorFlow (or numpy) will handle the rest. This allows for more concise and readable code when performing element-wise operations on tensors.
gradient-based optimization
Gradient-based optimization is a cornerstone technique in machine learning, particularly in training neural networks. It involves using the gradient of a loss function with respect to the model’s parameters to minimize the loss and improve the model’s performance. The most common gradient-based optimization method is gradient descent.
Key Concepts
1. Gradient: The gradient is a vector of partial derivatives that indicates the direction and rate of the fastest increase of a function. For a loss function \( L(\theta) \) where \( \theta \) represents the model parameters, the gradient \( \nabla L(\theta) \) points in the direction of the steepest ascent. To minimize the loss, we move in the opposite direction of the gradient.
2. Loss Function: This is a measure of how well the model’s predictions match the actual data. Common loss functions include mean squared error for regression tasks and cross-entropy for classification tasks.
3. Learning Rate: A hyperparameter that controls the size of the steps taken to reach a minimum. A learning rate that is too high may overshoot the minimum, while a learning rate that is too low may take too long to converge.
Gradient Descent Variants
1. Batch Gradient Descent: Computes the gradient of the loss function with respect to the entire dataset. While this approach is accurate, it can be very slow and computationally expensive for large datasets.
2. Stochastic Gradient Descent (SGD): Computes the gradient of the loss function with respect to a single training example. This approach is faster and can handle large datasets, but the updates can be noisy and cause the loss to fluctuate.
3. Mini-Batch Gradient Descent: Computes the gradient of the loss function with respect to a small batch of training examples. This approach balances the trade-offs between batch and stochastic gradient descent, offering faster convergence and more stable updates.
Optimization Algorithms
1. Standard Gradient Descent:
[latex]
\theta := \theta – \eta \nabla L(\theta)
[/latex]
where [latex] \eta [/latex] is the learning rate.
2. Momentum:
[latex]
v := \beta v + \eta \nabla L(\theta)
[/latex]
[latex]
\theta := \theta – v
[/latex]
Momentum helps accelerate gradient descent by considering the previous gradients to smooth out the updates.
3. RMSprop:
[latex]
E[g^2]_t := \gamma E[g^2]_{t-1} + (1 – \gamma)g_t^2
[/latex]
[latex]
\theta := \theta – \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t
[/latex]
RMSprop adjusts the learning rate for each parameter, scaling it by the inverse square root of the running average of recent gradients.
4. Adam (Adaptive Moment Estimation):
[latex]
m_t := \beta_1 m_{t-1} + (1 – \beta_1) g_t
[/latex]
[latex]
v_t := \beta_2 v_{t-1} + (1 – \beta_2) g_t^2
[/latex]
[latex]
\hat{m}_t := \frac{m_t}{1 – \beta_1^t}
[/latex]
[latex]
\hat{v}_t := \frac{v_t}{1 – \beta_2^t}
[/latex]
[latex]
\theta := \theta – \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t
[/latex]
Adam combines the ideas of momentum and RMSprop, maintaining an exponentially decaying average of past gradients and squared gradients.
Example in TensorFlow
Here is an example of using gradient descent with TensorFlow to optimize a simple linear regression model.
import tensorflow as tf import numpy as np # Generate some synthetic data np.random.seed(0) X = np.random.rand(100, 1) y = 2 * X + 1 + 0.1 * np.random.randn(100, 1) # Define the model class LinearModel(tf.Module): def __init__(self): self.W = tf.Variable(np.random.randn(), dtype=tf.float32) self.b = tf.Variable(np.random.randn(), dtype=tf.float32) def __call__(self, x): return self.W * x + self.b # Define the loss function def loss_fn(model, x, y): y_pred = model(x) return tf.reduce_mean(tf.square(y - y_pred)) # Training function def train(model, x, y, learning_rate): with tf.GradientTape() as tape: loss = loss_fn(model, x, y) gradients = tape.gradient(loss, [model.W, model.b]) model.W.assign_sub(learning_rate * gradients[0]) model.b.assign_sub(learning_rate * gradients[1]) # Initialize the model model = LinearModel() # Training loop learning_rate = 0.1 epochs = 100 for epoch in range(epochs): train(model, X, y, learning_rate) if epoch % 10 == 0: current_loss = loss_fn(model, X, y) print(f"Epoch {epoch}: Loss: {current_loss.numpy()}") print(f"Trained Weights: W = {model.W.numpy()}, b = {model.b.numpy()}")
Explanation
1. Synthetic Data: Generating some synthetic data for a simple linear regression problem.
2. Model Definition: A simple linear model with one weight and one bias.
3. Loss Function: Mean squared error between the predicted and actual values.
4. Training Function: Uses TensorFlow’s `GradientTape` to compute the gradients and update the model parameters using gradient descent.
5. Training Loop: Iteratively updates the model parameters over a number of epochs.
Gradient-based optimization techniques like these are essential for training neural networks and other machine learning models, enabling them to learn from data and improve their performance over time.
activation function
ReLU: https://www.kaggle.com/code/dansbecker/rectified-linear-units-relu-in-deep-learning
overfitting
Overfitting is particularly likely to occur when your data is noisy, if it involves
uncertainty, or if it includes rare features. (Chollet, 2021)
Overfitting occurs when a statistical model or machine learning algorithm captures noise or random fluctuations in the training data rather than the underlying data distribution. This typically happens when the model is too complex, having too many parameters relative to the number of observations or features in the training data. As a result, the model performs very well on the training data but poorly on unseen or test data.
Here are some common causes of overfitting:
1. Excessive Model Complexity: Using models that are too complex for the amount of data available, such as deep neural networks with many layers for a relatively small dataset.
2. Insufficient Training Data: Not having enough training data can lead to the model capturing noise specific to the training set rather than generalizable patterns.
3. Noise in Data: If the training data contains a lot of noise or irrelevant information, the model might learn these noise patterns instead of the true underlying trends.
4. High Variance in the Model: Models with high variance are very flexible and can fit the training data very closely, which leads to overfitting.
Signs of Overfitting
– High accuracy on training data but significantly lower accuracy on validation/test data.
– The model captures noise and outliers in the training data.
– High variance in model performance when using different subsets of the training data.
Techniques to Prevent Overfitting
– Simplifying the Model: Reducing the complexity of the model by using fewer parameters or features.
– Regularization: Techniques like L1 (Lasso) and L2 (Ridge) regularization add a penalty to the model complexity.
– Cross-Validation: Using k-fold cross-validation to ensure the model performs well on different subsets of the data.
– Pruning: In decision trees, pruning helps in removing parts of the tree that do not provide power to classify instances.
– Early Stopping: In iterative learning algorithms like gradient descent, training can be stopped early when performance on a validation set starts to degrade.
– Ensembling: Combining the predictions of multiple models (bagging, boosting, stacking) to reduce overfitting.
By applying these techniques, you can develop models that generalize better to new, unseen data.
Generalization in machine learning refers to the ability of a model to perform well on new, previously unseen data, drawn from the same distribution as the data used to train the model. It is a measure of how well the concepts learned by the model can be applied to real-world scenarios outside the training set.
Key Aspects of Generalization:
1. Training vs. Testing Performance:
– Training Performance: How well the model performs on the training data.
– Testing Performance: How well the model performs on a separate set of data not seen during training (testing/validation data). Good generalization is indicated by similar performance on both training and testing data.
2. Overfitting:
– Occurs when a model learns the training data too well, capturing noise and details that do not generalize to new data.
– An overfitted model has low training error but high testing error.
3. Underfitting:
– Occurs when a model is too simple to capture the underlying patterns in the training data.
– An underfitted model has high training error and, consequently, high testing error.
4. Bias-Variance Tradeoff:
– Bias: Error due to overly simplistic models that do not capture the complexity of the data.
– Variance: Error due to models that are too complex and sensitive to the noise in the training data.
– The goal is to find a balance between bias and variance to achieve good generalization.
5. Regularization:
– Techniques like L1 and L2 regularization, dropout, and early stopping are used to prevent overfitting and improve generalization.
6. Cross-Validation:
– A method to evaluate the generalization ability of a model by dividing the data into multiple subsets and training/testing the model on these subsets.
Techniques to Improve Generalization:
1. Data Augmentation:
– Increasing the diversity of training data by creating modified versions of existing data.
2. Ensemble Methods:
– Combining multiple models to reduce variance and improve robustness.
3. Hyperparameter Tuning:
– Optimizing the parameters that control the learning process to enhance performance.
4. Pruning:
– Reducing the complexity of models by removing parts that are not contributing significantly to predictions.
Generalization is crucial because it determines the practical usability of a machine learning model. A model that generalizes well is more likely to perform reliably when applied to new data in real-world situations.
training loop
Training Loop Steps
1. Initialization:
– The model’s weights (parameters) are initialized randomly.
– A dataset is prepared with input data (features) and corresponding correct output (labels).
2. Forward Pass:
– The input data is fed into the model.
– The model processes the input data through its layers and generates predictions.
3. Loss Calculation:
– The predictions are compared to the actual labels to calculate the loss.
– The loss is a measure of how far the model’s predictions are from the actual values.
4. Backward Pass (Backpropagation):
– The loss is propagated back through the network to calculate the gradients of the loss with respect to each weight.
– This involves using the chain rule to compute derivatives.
5. Weight Update (Optimization):
– The model’s weights are updated using an optimization algorithm (e.g., Stochastic Gradient Descent, Adam) to minimize the loss.
– This step uses the gradients calculated during the backward pass.
6. Iteration:
– Steps 2 to 5 are repeated for a fixed number of iterations (epochs) or until the model’s performance stops improving.
Example Training Loop in Python (using PyTorch)
import torch import torch.nn as nn import torch.optim as optim # Dummy dataset inputs = torch.randn(100, 10) # 100 samples, 10 features each labels = torch.randn(100, 1) # 100 samples, 1 output each # Simple model class SimpleModel(nn.Module): def __init__(self): super(SimpleModel, self).__init__() self.fc = nn.Linear(10, 1) def forward(self, x): return self.fc(x) model = SimpleModel() # Loss function and optimizer criterion = nn.MSELoss() # Mean Squared Error Loss optimizer = optim.SGD(model.parameters(), lr=0.01) # Training loop num_epochs = 100 for epoch in range(num_epochs): model.train() # Set model to training mode # Forward pass outputs = model(inputs) loss = criterion(outputs, labels) # Backward pass and optimization optimizer.zero_grad() # Clear the gradients loss.backward() # Calculate the gradients optimizer.step() # Update the weights if (epoch+1) % 10 == 0: print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}') print("Training complete!")
Explanation of the Code
1. Dataset Preparation:
– `inputs` and `labels` are randomly generated tensors representing the features and labels.
2. Model Definition:
– `SimpleModel` is a simple neural network with one linear layer (`fc`).
3. Loss Function and Optimizer:
– `criterion` is the Mean Squared Error loss function.
– `optimizer` is Stochastic Gradient Descent (SGD) with a learning rate of 0.01.
4. Training Loop:
– The loop runs for `num_epochs` (100 iterations).
– In each iteration:
– The model performs a forward pass to make predictions (`outputs`).
– The loss is computed using the predictions and actual labels.
– The gradients are cleared using `optimizer.zero_grad()`.
– Backpropagation is performed using `loss.backward()` to compute gradients.
– The optimizer updates the model’s weights using `optimizer.step()`.
– Every 10 epochs, the current loss is printed.
This is a basic example to illustrate the core steps in a training loop. In real applications, the dataset would be split into training and validation sets, and more complex models and optimization techniques would be used.
chain Rule
The chain rule is a fundamental concept in calculus used extensively in deep learning for backpropagation. It helps in calculating the gradient of a loss function with respect to each weight in a neural network. Let’s break it down with an example.
The chain rule is used to compute the derivative of a composite function. If you have a function [latex] z = f(g(x)) [/latex], the chain rule states that the derivative of z with respect to x is:
[latex]
\frac{dz}{dx} = \frac{dz}{dg} \cdot \frac{dg}{dx}
[/latex]
In other words, you first compute the derivative of the outer function f with respect to the inner function g, and then multiply it by the derivative of the inner function g with respect to x.
Chain Rule in Deep Learning
In a neural network, you typically have multiple layers, and you need to compute how changes in the weights of each layer affect the final loss. Here’s a simple neural network with one hidden layer:
1. Input Layer: [latex]x[/latex]
2. Hidden Layer: [latex] h = f(W_1 \cdot x + b_1) [/latex]
3. Output Layer: [latex] y = g(W_2 \cdot h + b_2) [/latex]
4. Loss Function: [latex] L = \text{loss}(y, \text{true\_label}) [/latex]
To update the weights [latex]W_1[/latex]and [latex]W_2[/latex], you need to compute the gradient of the loss [latex] L [/latex] with respect to these weights.
Example with Detailed Steps
Let’s take a specific example with simple functions and numbers.
1. Forward Pass:
– Suppose [latex]x = 1[/latex]
– [latex]W_1 = 2[/latex], [latex]b_1 = 0 [/latex], so [latex]h = f(W_1 \cdot x + b_1) = f(2 \cdot 1 + 0) = f(2) [/latex]
– Assume [latex] f(z) = z^2 [/latex], so [latex] h = 2^2 = 4 [/latex]
– [latex] W_2 = 3 [/latex], [latex]b_2 = 0[/latex], so [latex] y = g(W_2 \cdot h + b_2) = g(3 \cdot 4 + 0) = g(12) [/latex]
– Assume [latex]g(z) = z[/latex], so y = 12
– Suppose the true label is [latex]10 [/latex], and the loss function is Mean Squared Error: [latex] L = (y – \text{true\_label})^2 = (12 – 10)^2 = 4 [/latex]
2. Backward Pass (Using Chain Rule):
– Compute the gradient of the loss with respect to [latex] y [/latex]:
[latex] \frac{dL}{dy} = 2 \cdot (y – \text{true\_label}) = 2 \cdot (12 – 10) = 4 [/latex]
– Compute the gradient of \( y \) with respect to \( W_2 \):
[latex]\frac{dy}{dW_2} = h = 4 [/latex]
– Using the chain rule, the gradient of the loss with respect to \( W_2 \):
[latex]\frac{dL}{dW_2} = \frac{dL}{dy} \cdot \frac{dy}{dW_2} = 4 \cdot 4 = 16 [/latex]
– Now, compute the gradient of \( y \) with respect to \( h \):
[latex]\frac{dy}{dh} = W_2 = 3 [/latex]
– Compute the gradient of \( h \) with respect to \( W_1 \):
[latex] \frac{dh}{dW_1} = \frac{d}{dW_1} f(W_1 \cdot x + b_1) = \frac{d}{dW_1} (2 \cdot x)^2 = 2 \cdot 2 \cdot x = 4 [/latex]
– Using the chain rule, the gradient of the loss with respect to \( W_1 \):
[latex]\frac{dL}{dW_1} = \frac{dL}{dy} \cdot \frac{dy}{dh} \cdot \frac{dh}{dW_1} = 4 \cdot 3 \cdot 4 = 48 [/latex]
Summary
– Forward pass: Calculate the output and the loss.
– Backward pass: Compute the gradients using the chain rule.
– For each layer, calculate the gradient of the loss with respect to its weights by multiplying the gradients of the subsequent layers (starting from the output layer).
Example in Code (PyTorch)
Here’s how you would implement this in PyTorch:
import torch import torch.nn as nn import torch.optim as optim # Define a simple model with one hidden layer class SimpleModel(nn.Module): def __init__(self): super(SimpleModel, self).__init__() self.fc1 = nn.Linear(1, 1) # W1 self.fc2 = nn.Linear(1, 1) # W2 def forward(self, x): h = self.fc1(x) h = h ** 2 # Squaring function as activation y = self.fc2(h) return y model = SimpleModel() # Input and target inputs = torch.tensor([[1.0]]) labels = torch.tensor([[10.0]]) # Loss function and optimizer criterion = nn.MSELoss() optimizer = optim.SGD(model.parameters(), lr=0.01) # Forward pass outputs = model(inputs) loss = criterion(outputs, labels) # Backward pass and optimization optimizer.zero_grad() loss.backward() # Gradients for name, param in model.named_parameters(): if param.grad is not None: print(f'Gradient of {name} is {param.grad}') # Update weights optimizer.step()
This code defines a simple model, performs a forward pass to calculate the output and loss, and then uses the backward pass to compute gradients using the chain rule. The gradients are printed to show the calculated values before the weights are updated.
Stochastic Gradient Descent
SGD stands for Stochastic Gradient Descent. It’s a fundamental optimization algorithm used in training machine learning models, particularly in deep learning.
Here’s a simple breakdown:
1. Gradient Descent: Imagine you’re on a mountain and want to get to the lowest point, which represents the minimum of a function (in deep learning, this function represents the error or loss). Gradient Descent is like taking small steps downhill. At each step, you look at the slope (gradient) of the hill at your current position and take a step in the direction that goes downhill.
2. Stochastic: Now, instead of looking at the entire slope of the mountain at once, we take a random sample (a random point) of the slope and take a step based on that. This randomness can make the process faster and can sometimes help escape from local minima (points that are low but not the lowest).
3. Training a Neural Network: In deep learning, we have a neural network with lots of parameters (weights and biases). The goal is to find the best values for these parameters that minimize the error between the predicted outputs and the actual outputs. SGD helps us adjust these parameters by computing the gradient of the error with respect to each parameter and updating them in the direction that decreases the error.
4. Example: Let’s say you’re trying to teach a neural network to recognize handwritten digits. You show it a bunch of images of digits along with their labels (e.g., an image of a “3” labeled as “3”). Initially, the neural network makes random guesses about what the digits are. You use SGD to adjust the parameters (weights and biases) of the network based on the errors it makes. For each image, you compute how much the network’s guess differs from the actual label, and you adjust the parameters a little bit to reduce that difference. You do this for many images (possibly going through the dataset multiple times), gradually improving the network’s ability to correctly recognize digits.
So, in essence, SGD is like a guided downhill walk where you take small steps based on random observations to reach the lowest point (minimum error) efficiently.
hyperparameter tuning
There are many different ways to potentially improve a neural network. Some of the most common include: increasing the number of layers (making the network deeper), increasing the number of hidden units (making the network wider) and changing the learning rate. Because these values are all human-changeable, they’re referred to as hyperparameters) and the practice of trying to find the best hyperparameters is referred to as hyperparameter tuning.
from https://dev.mrdbourke.com/tensorflow-deep-learning/01_neural_network_regression_in_tensorflow/
An example import numpy as np from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier from sklearn.model_selection import GridSearchCV # Function to create model, required for KerasClassifier def create_model(optimizer='adam', init='uniform'): model = Sequential() model.add(Dense(12, input_dim=8, kernel_initializer=init, activation='relu')) model.add(Dense(8, kernel_initializer=init, activation='relu')) model.add(Dense(1, kernel_initializer=init, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy']) return model # Fix random seed for reproducibility seed = 7 np.random.seed(seed) # Load dataset # Assuming X and y are your features and labels # For example: # X = np.array([...]) # y = np.array([...]) # Create model model = KerasClassifier(build_fn=create_model, verbose=0) # Define the grid search parameters param_grid = { 'batch_size': [10, 20, 40], 'epochs': [50, 100], 'optimizer': ['SGD', 'Adam'], 'init': ['uniform', 'normal'] } # Create GridSearchCV grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3) # Fit the model grid_result = grid.fit(X, y) # Summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, std, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, std, param)) ``` Explanation: 1. Model Creation Function: `create_model` defines the structure and compilation of the neural network. 2. KerasClassifier: Wraps the Keras model so it can be used by scikit-learn's `GridSearchCV`. 3. Grid Search Parameters: Defines the hyperparameters and their possible values. 4. GridSearchCV: Performs the grid search over the specified hyperparameters. 5. Fitting the Model: Trains the model on the provided dataset and searches for the best hyperparameters. 6. Results: Prints out the best score and the corresponding hyperparameters, as well as the mean and standard deviation of the scores for each combination. By using this approach, you can systematically explore a range of hyperparameters to find the combination that yields the best performance for your neural network.
Pytorch
Tensor Tensors are a specialized data structure that are very similar to arrays and matrices. Tensors are similar to NumPy’s ndarrays, except that tensors can run on GPUs or other hardware accelerators. In fact, tensors and NumPy arrays can often share the same underlying memory, eliminating the need to copy data (see Bridge with NumPy). Tensors are also optimized for automatic differentiation (we’ll see more about that later in the Autograd section). If you’re familiar with ndarrays, you’ll be right at home with the Tensor API. import torch import numpy as np data = [[1, 2],[3, 4]] x_data = torch.tensor(data) np_array = np.array(data) x_np = torch.from_numpy(np_array) x_ones = torch.ones_like(x_data) # retains the properties of x_data print(f"Ones Tensor: \n {x_ones} \n") x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data print(f"Random Tensor: \n {x_rand} \n") Ones Tensor: tensor([[1, 1], [1, 1]]) Random Tensor: tensor([[0.4223, 0.1719], [0.3184, 0.2631]])
import torch from torch import nn from torch.utils.data import DataLoader from torchvision import datasets from torchvision.transforms import ToTensor from torchvision import datasets, transforms
from torch.utils.data import DataLoader train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True) test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)
import os import torch from torch import nn from torch.utils.data import DataLoader from torchvision import datasets, transforms