A Step-by-Step Guide to Training Microsoft PHI-3 Vision-Language Model (VLLM)

8 min readOct 30, 2024

Training a Vision-Language Learning Model (VLLM) might sound intimidating at first, but with the right guidance and tools, it becomes an exciting and rewarding experience. In this detailed guide, I will walk you through my entire process of training Microsoft’s PHI-3 model, explaining each step with the code I used and how you can replicate it from scratch. Let’s dive in!

1. Introduction to Vision-Language Learning Models and PHI-3

Vision-Language Learning Models are designed to understand and generate natural language based on images, combining computer vision and natural language processing. The PHI-3 model from Microsoft is a powerful VLLM, designed for complex tasks like answering questions based on visual input. These models can be used in a wide range of applications, such as automated image captioning, visual question answering, and enhancing accessibility tools for visually impaired individuals. This guide will cover everything from gathering and preparing data to fine-tuning the model and testing it on a sample image, providing a comprehensive understanding of the entire process.

2. Setting Up Your Environment

Before we start with data gathering and model training, we need to set up our environment. We’ll use Google Colab for its free GPU access, which is essential for training such large models. Google Colab provides an easy-to-use interface with access to powerful GPUs and TPUs, making it an ideal choice for training deep learning models without requiring expensive hardware.

Step 1: Install Necessary Packages

Start by installing all the required packages. Open a Google Colab notebook and run the following commands:

!pip install numpy pandas Pillow matplotlib scikit-learn tqdm datasets ipython huggingface_hub transformers peft torch seaborn bitsandbytes

These libraries include everything you need for data processing, model training, and interaction with Hugging Face’s hub. Additionally, you may choose to install `flash_attn` if you have access to a powerful GPU like NVIDIA A100. Each of these packages plays an important role:

- numpy and pandas: Essential for data manipulation and analysis.
- Pillow: Used for image processing.
- matplotlib and seaborn: Useful for data visualization.
- scikit-learn: Provides tools for splitting datasets and other machine learning utilities.
- tqdm: Displays progress bars for loops.
- datasets: For managing datasets, particularly from Hugging Face.
- huggingface_hub: To interact with Hugging Face for pushing and pulling datasets and models.
- transformers: The core library for working with state-of-the-art machine learning models, especially for NLP and VLLMs.
- peft: Parameter-efficient fine-tuning for optimizing model training.
- torch: The main library for creating and training deep learning models.
- bitsandbytes: For optimizing training on GPUs, especially when dealing with large models.

3. Creating a Hugging Face Account

To store and retrieve your model and dataset efficiently, you will need a Hugging Face account.

- Go to [Hugging Face](https://huggingface.co/) and create an account.
- Once logged in, click on your profile and navigate to “Settings”.
- Under “Access Tokens”, create a new token with write access, which you’ll need later for pushing models and datasets.

Hugging Face offers a powerful platform for sharing machine learning models and datasets. It also provides an easy-to-use interface and API for managing these resources, making it an invaluable tool for machine learning practitioners. By creating an account and generating an access token, you can seamlessly integrate your work with the platform, allowing you to share your models with the community or simply store them for personal use.

4. Data Gathering and Preparation

To train the model, we need a dataset consisting of images and their corresponding captions. This data forms the foundation of the VLLM, allowing the model to learn the association between visual content and textual descriptions.

Step 1: Gather Data

I gathered data from research papers that included both images and detailed captions. The dataset is crucial for training the model to understand complex visual scenes and generate meaningful responses. I saved all images in a folder and stored their captions in a JSON file. The JSON structure looked like this:

[
 {
 “id”: “data/images/000779-Figure4–1.png”,
 “image”: “000779-Figure4–1.png”,
 “caption”: “Fig. 4 Analogy between the conventional glass transition …”
 },
 …
]

Each entry contains the image path and its caption, which is essential for training the model. The captions provide the model with a textual description of what is depicted in the image, enabling it to learn the relationships between visual elements and language.

Step 2: Convert JSON to CSV

To make the dataset compatible with Hugging Face, I converted it into a CSV file. The CSV format is easier to work with when loading data into machine learning frameworks like PyTorch or TensorFlow. Here’s the code I used:


import os
import json
import csv
import shutil
from sklearn.model_selection import train_test_split

# Define paths
data_folder = “data”
images_folder = os.path.join(data_folder, “images”)
json_file = os.path.join(data_folder, “data.json”)
output_folder = “IdeficsData”
output_images_train_folder = os.path.join(output_folder, “images/train”)
output_images_test_folder = os.path.join(output_folder, “images/test”)

# Create necessary directories
os.makedirs(output_folder, exist_ok=True)
os.makedirs(output_images_train_folder, exist_ok=True)
os.makedirs(output_images_test_folder, exist_ok=True)

# Load the JSON data
with open(json_file, ‘r’) as f:
 data = json.load(f)

# Split data into train and test sets (90% train, 10% test)
train_data, test_data = train_test_split(data, test_size=0.1, random_state=42)

# Create a single CSV file
csv_file = os.path.join(output_folder, “qa_text.csv”)

with open(csv_file, ‘w’, newline=’’) as csvfile:
 writer = csv.writer(csvfile)
 writer.writerow([‘id’, ‘query’, ‘answers’]) # CSV header

def process_data(data_list, data_type, output_image_folder):
 for idx, item in enumerate(data_list):
 new_image_name = f”{data_type}_{idx}.png”
 query = f”What is described in this image?”
 answer = item[‘caption’]

# Write to CSV
 writer.writerow([f”{data_type}_{idx}”, query, answer])

# Copy and rename image
 old_image_path = os.path.join(images_folder, item[‘image’])
 new_image_path = os.path.join(output_image_folder, new_image_name)
 shutil.copy(old_image_path, new_image_path)

process_data(train_data, “train”, output_images_train_folder)
 process_data(test_data, “test”, output_images_test_folder)

print(f”Data has been processed and saved in ‘{output_folder}’ with images renamed and a single CSV file created.”)

This script converts the JSON data to CSV format, splits the data into training and test sets, and saves the images accordingly. By splitting the data into training and testing sets, we ensure that the model is validated on unseen data, which is crucial for assessing its generalization capabilities.

5. Uploading Data to Hugging Face

Once the data is ready, we upload it to Hugging Face so we can access it anytime for training. Hugging Face’s dataset hub allows you to store and share datasets, making it easy to collaborate with others or access your datasets across different projects.


from datasets import Dataset
import pandas as pd
import os

train_images_directory = ‘/path/to/IdeficsData/images/train/’
test_images_directory = ‘/path/to/IdeficsData/images/test/’
qa_text = pd.read_csv(‘/path/to/IdeficsData/qa_text.csv’)

# Prepare dataset dictionary
dataset_dict = {
 ‘id’: qa_text[‘id’].tolist(),
 ‘image’: train_images_directory + qa_text[‘id’] + “.png”,
 ‘query’: qa_text[‘query’].tolist(),
 ‘answers’: qa_text[‘answers’].tolist()
}

# Create the dataset
dataset = Dataset.from_dict(dataset_dict)

# Push dataset to Hugging Face
from huggingface_hub import login
login(token=’YOUR_API_TOKEN’)

dataset.push_to_hub(“your_hf_username/DSCDataSet”)

Replace `YOUR_API_TOKEN` with the token you generated earlier. This will push your dataset to your Hugging Face account, allowing you to access it from anywhere. Uploading your dataset to Hugging Face not only makes it easily accessible but also allows you to share it with others in the community, fostering collaboration.

6. Training the PHI-3 Model

With the dataset uploaded, we can now train the PHI-3 model. Training involves fine-tuning the pre-trained PHI-3 model on our specific dataset so that it can learn to generate accurate captions for the images.

Step 1: Load the Model

We start by loading the pre-trained PHI-3 model and its processor. The processor handles both text and image inputs, preparing them for the model.


from transformers import AutoModelForCausalLM, AutoProcessor
import torch 

base_model_id = ‘microsoft/Phi-3-vision-128k-instruct’
model = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype=torch.bfloat16, trust_remote_code=True).to(‘cuda:0’)
processor = AutoProcessor.from_pretrained(base_model_id, trust_remote_code=True)

Step 2: Configure LoRA for Efficient Training

LoRA (Low-Rank Adaptation) is used to train large models more efficiently by fine-tuning specific layers. LoRA reduces the number of trainable parameters, making the process more computationally feasible while retaining the model’s ability to learn effectively.


from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(r=8, lora_alpha=16, bias=”none”, lora_dropout=0.1, target_modules=[“gate_up_proj”, “down_proj”], task_type=”CAUSAL_LM”)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Step 3: Define the Data Collator and Training Arguments

Next, we define the data collator, which processes each batch of data during training, and set the training arguments. The training arguments include important hyperparameters like the learning rate, number of epochs, batch size, and other settings that control the training process.


from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
 output_dir="PHI3VModel",
 num_train_epochs=4,
 per_device_train_batch_size=1,
 gradient_accumulation_steps=4,
 learning_rate=1e-5,
 bf16=True,
 save_total_limit=50
)

Step 4: Start Training

We use the `Trainer` class from the Transformers library to train the model. The `Trainer` class simplifies the training process by managing the optimization loop, evaluation, and saving checkpoints.


trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=train_dataset,
 eval_dataset=eval_dataset,
 data_collator=collator
)

trainer.train()

Finally, push the trained model to Hugging Face for later use:


model.push_to_hub(“your_hf_username/MicrosoftPHI3”)
processor.push_to_hub(“your_hf_username/MicrosoftPHI3”)

Pushing the model to Hugging Face allows you to easily share your trained model or use it for inference on different devices or environments.

7. Inference and Testing the Model

Once the model is trained, it’s time to test it! Inference is the process of using the trained model to generate predictions, such as answering questions about an image.


from PIL import Image
import requests
from transformers import AutoTokenizer

# Load an image
demo_image_url = "https://data-mining.philippe-fournier-viger.com/wp-content/uploads/2013/07/chart.png"
image = Image.open(requests.get(demo_image_url, stream=True).raw)

# Generate a response
question = “What is described in this image?”
inputs = processor(question, image, return_tensors=”pt”).to(“cuda:0”)
generate_ids = model.generate(**inputs, max_new_tokens=1000)
response = processor.batch_decode(generate_ids, skip_special_tokens=True)[0]
print(“Model’s response:”, response)

This code snippet demonstrates how to use the trained model to generate answers based on an input image. You can replace the image URL with any other image to test the model’s capabilities. The model processes the image along with the question and generates a text response that describes the image.

8. Conclusion

Training the Microsoft PHI-3 Vision-Language Model requires careful data preparation, efficient use of cloud infrastructure, and leveraging tools like LoRA to make the process manageable. By following this guide, you should have a clearer understanding of how to gather data, prepare it, train a VLLM, and test it effectively.

The beauty of VLLMs lies in their ability to understand the context of both visual and textual information, and with the PHI-3 model, we can create systems that provide meaningful insights from images. These models have a wide range of potential applications, from enhancing accessibility for visually impaired individuals to automating content generation and providing educational insights.

I hope this guide helps you embark on your VLLM journey! If you have any questions or run into issues, feel free to reach out. Machine learning is a field where collaboration and sharing knowledge are key, and I’d be more than happy to help you troubleshoot or offer advice. Happy training and exploring the limitless possibilities of Vision-Language Learning Models!

— -

Next Steps:

1. Experiment with Different Datasets: Try training the PHI-3 model on different datasets to see how it performs on various types of visual content.
2. Fine-Tune Hyperparameters: Adjust the learning rate, batch size, and other hyperparameters to optimize model performance.
3. Deploy Your Model: Consider deploying your trained model as a web service using platforms like Hugging Face Spaces or Flask, making it accessible to others.

The world of VLLMs is vast, and there’s so much more to explore. This is just the beginning — keep experimenting and pushing the boundaries of what’s possible with machine learning and AI!