Introduction
TinyBERT is a compact version of BERT (Bidirectional Encoder Representations from Transformers) designed to offer similar performance with a significantly smaller model size. In this tutorial, we’ll demonstrate how to use TinyBERT_General_4L_312D
for text classification.
Prerequisites
- Python 3.6 or higher
- PyTorch
- Transformers library by Hugging Face
- Datasets for training and testing
Step 1: Install Required Libraries
First, let’s install the necessary libraries:
pip install torch transformers datasets
Step 2: Load TinyBERT Model and Tokenizer
We need to load the TinyBERT model and its corresponding tokenizer from the Hugging Face Transformers library.
from transformers import BertTokenizer, BertForSequenceClassification
# Load TinyBERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('huawei-noah/TinyBERT_General_4L_312D')
model = BertForSequenceClassification.from_pretrained('huawei-noah/TinyBERT_General_4L_312D')
Step 3: Prepare the Dataset
We’ll use the datasets
library to load and preprocess a dataset. For this example, we’ll use the IMDB dataset for binary sentiment classification.
from datasets import load_dataset
# Load the IMDB dataset
dataset = load_dataset('imdb')
# Split the dataset into train and test sets
train_dataset = dataset['train']
test_dataset = dataset['test']
Step 4: Tokenize the Dataset
We need to tokenize the text data so that it can be fed into the TinyBERT model.
def tokenize_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)
# Tokenize the dataset
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)
Step 5: Prepare Data Loaders
PyTorch requires data to be loaded in batches. We’ll use the DataLoader
class for this purpose.
from torch.utils.data import DataLoader
# Define data collator
data_collator = lambda data: {
'input_ids': torch.tensor([f['input_ids'] for f in data], dtype=torch.long),
'attention_mask': torch.tensor([f['attention_mask'] for f in data], dtype=torch.long),
'labels': torch.tensor([f['label'] for f in data], dtype=torch.long)
}
# Create data loaders
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=data_collator)
test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=False, collate_fn=data_collator)
Step 6: Train the Model
Now, we’ll define a training loop to fine-tune the TinyBERT model on our dataset.
import torch
from torch.optim import AdamW
from tqdm import tqdm
# Set device (GPU or CPU)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
# Define optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)
# Training loop
model.train()
for epoch in range(3): # Train for 3 epochs
loop = tqdm(train_dataloader, leave=True)
for batch in loop:
optimizer.zero_grad()
# Move batch to device
batch = {k: v.to(device) for k, v in batch.items()}
# Forward pass
outputs = model(**batch)
loss = outputs.loss
# Backward pass
loss.backward()
optimizer.step()
# Update progress bar
loop.set_description(f'Epoch {epoch}')
loop.set_postfix(loss=loss.item())
Step 7: Evaluate the Model
After training, we need to evaluate the model’s performance on the test dataset.
model.eval()
correct = 0
total = 0
with torch.no_grad():
for batch in test_dataloader:
# Move batch to device
batch = {k: v.to(device) for k, v in batch.items()}
# Forward pass
outputs = model(**batch)
logits = outputs.logits
# Calculate accuracy
predictions = torch.argmax(logits, dim=-1)
correct += (predictions == batch['labels']).sum().item()
total += len(batch['labels'])
accuracy = correct / total
print(f'Test Accuracy: {accuracy:.4f}')
Conclusion
In this tutorial, we’ve shown how to use TinyBERT_General_4L_312D
for text classification. We’ve gone through loading the model and tokenizer, preparing the dataset, training the model, and evaluating its performance. TinyBERT offers a lightweight yet effective alternative to the original BERT model, making it suitable for deployment in resource-constrained environments.
In case you have found a mistake in the text, please send a message to the author by selecting the mistake and pressing Ctrl-Enter.