A project by Mohan Krishna G R, AI/ML Intern @ Infosys Springboard, Summer 2024.
- Problem Statement
 - Project Statement
 - Approach to Solution
 - Background Research
 - Solution
 - Workflow
 - Data Collection
 - Abstractive Text Summarization
 - Extractive Text Summarization
 - Testing
 - Deployment
 - Containerization
 - CI/CD Pipeline
 
- Developing an automated text summarization system that can accurately and efficiently condense large bodies of text into concise summaries is essential for enhancing business operations.
 - This project aims to deploy NLP techniques to create a robust text summarization tool capable of handling various types of documents across different domains.
 - The system should deliver high-quality summaries that retain the core information and contextual meaning of the original text.
 
- Text Summarization focuses on converting large bodies of text into a few sentences summing up the gist of the larger text.
 - There is a wide variety of applications for text summarization including News Summary, Customer Reviews, Research Papers, etc.
 - This project aims to understand the importance of text summarization and apply different techniques to fulfill the purpose.
 
- Figure: Intended Plan
 
- Literature Review
 
- Selected Deep Learning Architecture
 
- Workflow for Abstractive Text Summarizer:
 
- Workflow for Extractive Text Summarizer:
 
- Data Preprocessing & Pre-processing Implemented in 
src/data_preprocessing. - Data collection from different sources:
- CNN, Daily Mail: News
 - BillSum: Legal
 - ArXiv: Scientific
 - Dialoguesum: Conversations
 
 - Data integration ensures robust and multi-objective data, including News articles, Legal Documents – Acts, Judgements, Scientific papers, and Conversations.
 - Validated the data through Data Statistics and Exploratory Data Analysis (EDA) using Frequency Plotting for every data source.
 - Data cleansing optimized for NLP tasks: removed null records, lowercasing, punctuation removal, stop words removal, and lemmatization.
 - Data splitting using sci-kit learn for training, testing, and validating the model, saved in CSV format.
 
- Training:
- Selected transformer architecture for ABSTRACTIVE SUMMARIZATION: fine-tuning a pre-trained model.
 - Chosen Facebook’s Bart Large model for its performance metrics and efficient trainable parameters.
- 406,291,456 training parameters.
 
 
 
- Methods:
- Native PyTorch Implementation
 - Trainer API Implementation
 
 
- Trained the model using manual training loop and evaluation loop in PyTorch. Implemented in: 
src/model.ipynb - Model Evaluation: Source code:
src/evaluation.ipynb- Obtained inconsistent results in inferencing.
 - ROUGE1 (F-Measure) = 00.018
 - There's a suspected tensor error while training using method 1, which could be attributed to the inconsistency of the model's output.
 - Rejected for the further deployment.
 - Dire need to implement alternative approach.
 
 
- 
Utilized Trainer API from Hugging Face for optimized transformer model training. Implemented in:
src/bart.ipynb- The model was trained with whole dataset for 10 epochs for 26:24:22 (HH:MM:SS) in 125420 steps.
 
 - 
Evaluation: Performance metrics using ROUGE scores. Source code:
src/rouge.ipynb- Model 2 - results outperformed that of method 1.
 - ROUGE1 (F-Measure) = 61.32 -> Benchmark grade
- Significantly higher than typical scores reported for state-of-the-art models on common datasets.
 
 - GPT4 performance for text summarization - ROUGE1 (F-Measure) is 63.22
 - Selected for further deployment.
 
 - 
Comparative analysis showed significant improvement in performance after fine-tuning. Source code:
src/compare.ipynb 
- Rather than choosing computationally intensive deep-learning models, utilizing a rule based approach will result in optimal solution. Utilized a new-and-novel approach of combining the matrix obtained from TF-IDF and KMeans Clustering methodology.
 - It is the expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. It operates at the individual document and cluster level.
 - The sentence closest to the centroid (based on Euclidean distance) is selected as the representative sentence for that cluster.
 - Implementation: Preprocess text, extract features using TF-IDF, and summarize by selecting representative sentences.
- Source code for implentation & evaluation: 
src/Extractive_Summarization.ipynb - ROUGE1 (F-Measure) = 24.71
 
 - Source code for implentation & evaluation: 
 
- Implemented text summarization application using Gradio library for a web-based interface, for testing the model's inference.
 - Source Code: 
src/interface.ipynb 
- File Structure: 
summarize/ 
- Developed using FastAPI framework for handling URLs, files, and direct text input.
- Source Code: 
summarizer/app.py 
 - Source Code: 
 - Endpoints:
- Root Endpoint
 - Summarize URL
 - Summarize File
 - Summarize Text
 
 
- Extract text from various sources (URLs, PDF, DOCX) using BeautifulSoup and fitz.
 - Source Code: 
summarizer/extractors.py 
- Implemented extractive summarizer module. Same as implemented in: src/bart.ipynb
 - Source Code: 
summarizer/extractive_summary.py 
- Developed a user-friendly interface using HTML, CSS, and JavaScript.
 - Source Code: 
summarizer/templates/index.html 
- Developed a Dockerfile to build a Docker image for the FastAPI application.
 - Source Code: 
summarizer/Dockerfile - Image: Docker Image
 
- Developed a CI/CD pipeline using Docker, Azure and GitHub Actions.
 - Utilized Azure Container Instance (ACI) for deploying the image, triggers for every push to the main branch.
 - Source Code: 
.github/workflows/azure.yml.github/workflows/main.yml(AWS).github/workflows/azure.yml(Azure)
 - To use the docker image run:
 
docker pull mohankrishnagr/infosys_text-summarization:final
docker run -p 8000:8000 mohankrishnagr/infosys_text-summarization:final
Then checkout at,
http://localhost:8000/
Public IPv4:
http://54.168.82.95/
Public IPv4:
http://20.219.203.134:8000/
FQDN
http://mohankrishnagr.centralindia.azurecontainer.io:8000/
- Screenshots:
 
Thank you for your interest in our project! We welcome any feedback. Feel free to reach out to us.
