Transfer Learning

Simple Explanation: Transfer learning is a machine learning method where knowledge gained while solving one problem is used to solve a different but related problem. Think of it as applying your knowledge of how to ride a bicycle to learn how to ride a motorcycle. You already understand balancing and pedaling, so you just need to learn how to handle the speed and controls of a motorcycle, which speeds up the learning process.

Real-World Example: A common real-world example is image recognition. Suppose a model has been trained to recognize various types of animals in pictures. This model can use much of its learned knowledge (like identifying shapes and textures) to help in a new but related task, such as identifying specific breeds of dogs. The initial learning from the general task of recognizing animals simplifies the process of learning the more specific task of recognizing dog breeds.

Mathematical Foundations: Transfer learning typically modifies the final layers of a neural network. Here’s a simplified overview of the math behind it:

Model Initialization: Start with a pre-trained model, which has weights $W$ learned from a previous task.
Adaptation: Modify or replace the final layers of the model to make it suitable for the new task. This often involves changing the output layer to match the number of new categories or labels in the new task.
Freezing Layers: Often, the weights of the initial layers ( $W_{initial}$ ) are kept frozen, meaning they are not updated during the new training because they capture universal features like edges and textures that are useful across tasks.
Re-training: Train the modified layers ( $W_{new}$ ) on the new task data. Only these weights are updated to learn task-specific features.

The mathematical representation typically involves minimizing a loss function $L$ with respect to the new weights, while keeping the initial weights fixed. The training process adjusts $W_{new}$ to reduce errors in predictions for the new task.

Here’s an example of how you could implement transfer learning with a pre-trained model using Python and TensorFlow/Keras, adapting a model trained on ImageNet to a new task of classifying different types of flowers:

from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Model

# Load the pre-trained VGG16 model
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the layers of the base model
for layer in base_model.layers:
layer.trainable = False

# Add new layers for the new classification task
x = Flatten()(base_model.output)
x = Dense(1024, activation='relu')(x)
predictions = Dense(5, activation='softmax')(x) # Assuming 5 classes of flowers

# Create the new model
model = Model(inputs=base_model.input, outputs=predictions)

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Now you can train the model on your new data

This code snippet starts with a VGG16 model pre-trained on ImageNet. It freezes the convolutional base to preserve the learned features that are generally applicable to images (like detecting edges and textures), then adds a few trainable layers on top that can learn features specific to the new dataset of flower images. The model is then recompiled and ready to be trained on the new data.

Transfer learning, used in machine learning, is the reuse of a pre-trained model on a new problem. In transfer learning, a machine exploits the knowledge gained from a previous task to improve generalization about another. For example, for image classification, knowledge gained while learning to recognize cars could be applied when trying to recognize trucks. This topic is related to the psychological literature on transfer of learning, although practical ties between the two fields are limited. Reusing/transferring information from previously learned tasks to new tasks has the potential to significantly improve learning efficiency.

It is common to perform transfer learning with natural language processing problems that use text as input or output. For these types of problems, a word embedding is used that is a mapping of words to a high-dimensional continuous vector space where different words with a similar meaning have a similar vector representation. Efficient algorithms exist to learn these distributed word representations and it is common for research organizations to release pre-trained models trained on very large corpa of text documents under a permissive license.

In his book on Deep Learning for Natural Language Processing, Yoav Goldberg cautions:

… one can download pre-trained word vectors that were trained on very large quantities of text […] differences in training regimes and underlying corpora have a strong influence on the resulting representations, and that the available pre-trained representations may not be the best choice for [your] particular use case.

Fine-tuning a Neural Network explained: This video introduces the concept of fine-tuning, which is a form of transfer learning where a pre-trained model is adjusted to perform a new, but related, task. The video uses the VGG16 model with Keras to illustrate how fine-tuning is implemented in code. You can watch it here.

Pytorch Transfer Learning and Fine Tuning Tutorial: This tutorial demonstrates the process of fine-tuning and transfer learning using Pytorch. It’s helpful for understanding the technical steps involved in adapting a pre-trained model to a new task. The video is available here.

Step by step using image training

We’ll use a common scenario: fine-tuning a pre-trained model to classify a new set of images. For this example, let’s assume you want to classify different types of flowers.

Step 1: Choose a Pre-trained Model

TensorFlow and Keras offer several pre-trained models like VGG16, ResNet, Inception, etc. These models are trained on large datasets like ImageNet and have learned rich feature representations for a wide range of images. For simplicity, we’ll use VGG16.

Step 2: Load the Pre-trained Model

First, import necessary libraries and load the pre-trained VGG16 model without its top layer (since the top layer is specifically tailored to the original classification task of 1000 classes from ImageNet).

from tensorflow.keras.applications import VGG16

# Load the model without the top layer
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

Step 3: Freeze the Convolutional Base

Since the early layers of a pre-trained model have learned to detect universal features like edges and textures, we freeze them so that these learned features are not updated during the first phase of training.

for layer in base_model.layers:
layer.trainable = False

Step 4: Add Custom Layers

Now, add layers that you will train on your dataset. We’ll add a Flatten layer to convert the 2D outputs to 1D, followed by Dense layers for classification.

from tensorflow.keras import layers, models

# Create the model
model = models.Sequential()
model.add(base_model)
model.add(layers.Flatten())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dropout(0.5)) # Dropout for regularization
model.add(layers.Dense(5, activation='softmax')) # Assuming 5 classes

Step 5: Compile the Model

Set up the model for training. Specify the optimizer, loss function, and metrics to monitor.

model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])

Step 6: Prepare Your Data

You’ll need to prepare your dataset for training, which involves resizing images to the expected input size, applying transformations for data augmentation, and splitting the data into training and validation sets.

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Create a data generator for augmentation
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')

test_datagen = ImageDataGenerator(rescale=1./255) # No augmentation for validation data

# Flow training images in batches of 20 using train_datagen generator
train_generator = train_datagen.flow_from_directory(
'path_to_train_data',
target_size=(224, 224), # Resize images
batch_size=20,
class_mode='categorical')

validation_generator = test_datagen.flow_from_directory(
'path_to_validation_data',
target_size=(224, 224),
batch_size=20,
class_mode='categorical')

Step 7: Train the Model

Now, train the model on your new dataset. Since the convolutional base is frozen, only the weights in the Dense layers you added will be updated.

history = model.fit(
train_generator,
steps_per_epoch=100, # Total number of batches in the generator to complete one epoch
epochs=10, # Number of epochs to train the model
validation_data=validation_generator,
validation_steps=50) # Total number of batches in the validation generator

Step 8: Fine-Tuning (Optional)

Once the top layers are well-trained, you can unfreeze some of the upper layers in the convolutional base to allow fine-tuning of more specific features. This is typically done after the initial training with a very low learning rate to avoid destroying the features.

# Unfreeze the last convolutional layers
base_model.trainable = True
for layer in base_model.layers[:15]:
layer.trainable = False

# Compile the model with a low learning rate
model.compile(optimizer=tf.keras.optimizers.Adam(1e-5),
loss='categorical_crossentropy',
metrics=['accuracy'])

# Continue training
model.fit(train_generator, epochs=5, validation_data=validation_generator)

This step-by-step guide provides a detailed method to implement transfer learning, leveraging a pre-trained network to help improve the performance of your model on on a specific task, while minimizing the need to develop from scratch. This method not only saves time but also leverages powerful, pre-existing neural networks to enhance the performance and accuracy of your model on specialized tasks.

Let’s go through the steps of implementing transfer learning in Natural Language Processing (NLP) using a Python example. We’ll use a pre-trained model from Hugging Face’s transformers library, specifically BERT (Bidirectional Encoder Representations from Transformers), to fine-tune a text classification task. This example will focus on sentiment analysis, where the model will learn to classify text as positive or negative.

Step 1: Select a Pre-trained Model

BERT has been pre-trained on a large corpus of text and has learned to understand the nuances of the English language (syntax, context, etc.). This model can be adapted to a wide range of NLP tasks.

Step 2: Install Required Libraries

Ensure you have the necessary libraries installed:

pip install transformers torch

Step 3: Load the Pre-trained Model and Tokenizer

Import the model and tokenizer. The tokenizer converts text into tokens that can be processed by the model.

from transformers import BertModel, BertTokenizer, BertForSequenceClassification
import torch

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # num_labels=2 for binary classification

Step 4: Prepare Your Dataset

Prepare your dataset for training. You’ll need to tokenize your text data, align it to the format expected by BERT, and create data loaders.

from torch.utils.data import DataLoader, Dataset

class SentimentDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len

def __len__(self):
return len(self.texts)

def __getitem__(self, item):
text = str(self.texts[item])
label = self.labels[item]

encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
padding='max_length',
return_attention_mask=True,
return_tensors='pt',
)

return {
'text': text,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'labels': torch.tensor(label, dtype=torch.long)
}

# Example usage
dataset = SentimentDataset(texts=["I love this!", "I hate that!"], labels=[1, 0], tokenizer=tokenizer, max_len=512)
loader = DataLoader(dataset, batch_size=2)

Step 5: Define Training Loop

Implement the training loop where the model is fine-tuned on your dataset.



from transformers import AdamW

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)

for epoch in range(3): # run for more epochs depending on your dataset size and complexity
for batch in loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)

outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()

print(f"Loss: {loss.item()}")

Step 6: Evaluate the Model

After training, evaluate the model's performance on a validation or test set to see how well it performs on unseen data.

# This would involve running the model on a validation set and comparing the predicted labels with the actual labels.

Step 7: Use the Model for Inference

Use the fine-tuned model to make predictions on new text.

def predict(text, tokenizer, model):
model.eval()
inputs = tokenizer.encode_plus(text, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
outputs = model(**inputs.to(device))
prediction = torch.argmax(outputs.logits, dim=-1)
return prediction.item()

# Test the model
predict("This product is great!", tokenizer, model)

Conclusion

By fine-tuning BERT on your specific dataset, you leverage the rich representations BERT has learned during its initial training, allowing for more accurate and nuanced predictions in your specific NLP task. This process can be adapted to various other NLP tasks like text summarization, question answering, etc.

Professor Ha-Chin Yi Home Page

“You can always find the sun within yourself if you will only search.” — Maxwell Maltz

Transfer Learning