Multimodal_Visual_Knowledge_Assistant is a smart AI system that uses computer vision and natural language processing to understand images and generate meaningful text-based responses. By combining the power of CLIP and GPT-2, this project analyzes images across various domains—Medical, Fashion, Microscopy, and Nature—and produces relevant insights or creative descriptions.
- 🔍 Image Classification using OpenAI’s CLIP model.
- 🧠 Text Generation using GPT-2 for dynamic, context-aware language.
- 📷 Visual display of images, predictions, and generated texts.
- 🔄 Supports domain-specific prompts for deeper semantic relevance.
- 💡 Easy to scale by adding new categories and label sets.
| Task | Tool / Library |
|---|---|
| Multimodal Encoding | CLIP (openai/clip-vit-base-patch32) |
| Language Generation | GPT-2 (gpt2) |
| Deep Learning Framework | PyTorch |
| Tokenization & Modeling | Hugging Face Transformers |
| Image Display & Parsing | Pillow, IPython.display |
| Category | Example Labels | Prompt for GPT-2 |
|---|---|---|
| Medical | MRI with tumor, Normal MRI, X-ray with fracture, CT scan | "This image appears to be a medical scan. Based on the content, we can infer:" |
| Fashion | Red dress, Man in suit, Runway fashion, Casual outfit | "This image appears to be a fashion photo. Here's a marketing copy:" |
| Microscopy | Cell structure, Bacteria culture, Virus, Tissue sample | "This is a microscopy image. Scientifically, this could imply:" |
| Nature | Mountain landscape, Forest, Beach, Ocean pollution | "This is a natural scene. Here's a creative caption or fact:" |
-
Load Models
Load CLIP and GPT-2 from Hugging Face's model hub. -
Classify Image
CLIP processes the image and predicts the most probable label from a set of predefined domain-specific labels. -
Generate Text
GPT-2 generates a natural language output based on the predicted label and a domain-specific prompt. -
Display Results
The notebook displays the original image, predicted label with confidence, and generated descriptive text.
Make sure you have Python 3.8+ and install the required dependencies:
pip install torch torchvision
pip install transformers
pip install pillow
