Introduction

TinyBERT is a compact version of BERT (Bidirectional Encoder Representations from Transformers) designed to offer similar performance with a significantly smaller model size. In this tutorial, we’ll demonstrate how to use TinyBERT_General_4L_312D for text classification.

Prerequisites

  • Python 3.6 or higher
  • PyTorch
  • Transformers library by Hugging Face
  • Datasets for training and testing

Step 1: Install Required Libraries

First, let’s install the necessary libraries:

pip install torch transformers datasets

Step 2: Load TinyBERT Model and Tokenizer

We need to load the TinyBERT model and its corresponding tokenizer from the Hugging Face Transformers library.

from transformers import BertTokenizer, BertForSequenceClassification

# Load TinyBERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('huawei-noah/TinyBERT_General_4L_312D')
model = BertForSequenceClassification.from_pretrained('huawei-noah/TinyBERT_General_4L_312D')

Step 3: Prepare the Dataset

We’ll use the datasets library to load and preprocess a dataset. For this example, we’ll use the IMDB dataset for binary sentiment classification.

from datasets import load_dataset

# Load the IMDB dataset
dataset = load_dataset('imdb')

# Split the dataset into train and test sets
train_dataset = dataset['train']
test_dataset = dataset['test']

Step 4: Tokenize the Dataset

We need to tokenize the text data so that it can be fed into the TinyBERT model.

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

# Tokenize the dataset
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Step 5: Prepare Data Loaders

PyTorch requires data to be loaded in batches. We’ll use the DataLoader class for this purpose.

from torch.utils.data import DataLoader

# Define data collator
data_collator = lambda data: {
    'input_ids': torch.tensor([f['input_ids'] for f in data], dtype=torch.long),
    'attention_mask': torch.tensor([f['attention_mask'] for f in data], dtype=torch.long),
    'labels': torch.tensor([f['label'] for f in data], dtype=torch.long)
}

# Create data loaders
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=data_collator)
test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=False, collate_fn=data_collator)

Step 6: Train the Model

Now, we’ll define a training loop to fine-tune the TinyBERT model on our dataset.

import torch
from torch.optim import AdamW
from tqdm import tqdm

# Set device (GPU or CPU)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# Define optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training loop
model.train()
for epoch in range(3):  # Train for 3 epochs
    loop = tqdm(train_dataloader, leave=True)
    for batch in loop:
        optimizer.zero_grad()
        
        # Move batch to device
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        # Update progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

Step 7: Evaluate the Model

After training, we need to evaluate the model’s performance on the test dataset.

model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in test_dataloader:
        # Move batch to device
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Forward pass
        outputs = model(**batch)
        logits = outputs.logits
        
        # Calculate accuracy
        predictions = torch.argmax(logits, dim=-1)
        correct += (predictions == batch['labels']).sum().item()
        total += len(batch['labels'])

accuracy = correct / total
print(f'Test Accuracy: {accuracy:.4f}')

Conclusion

In this tutorial, we’ve shown how to use TinyBERT_General_4L_312D for text classification. We’ve gone through loading the model and tokenizer, preparing the dataset, training the model, and evaluating its performance. TinyBERT offers a lightweight yet effective alternative to the original BERT model, making it suitable for deployment in resource-constrained environments.

In case you have found a mistake in the text, please send a message to the author by selecting the mistake and pressing Ctrl-Enter.

Similar Posts