Deep Learning project implementing a many-to-one LSTM architecture for sentiment classification of movie reviews.
- Source: Kaggle - IMDB Dataset of 50K Movie Reviews
 - Original Size: 50,000 reviews
 - Cleaned Size: 49,578 reviews (422 duplicates removed)
 - Classes: Binary (Positive/Negative)
 - Balance: 1.01:1 (50.18% positive, 49.82% negative) β Perfectly balanced!
 
- Visit Kaggle - IMDB Dataset
 - Download 
IMDB Dataset.csv - Place it in 
data/folder asimdb_dataset.csv 
# Install Kaggle CLI
pip install kaggle
# Download dataset
kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
unzip imdb-dataset-of-50k-movie-reviews.zip -d data/
mv data/IMDB\ Dataset.csv data/imdb_dataset.csv- Model: Many-to-one LSTM
 - Layers: Embedding β LSTM β Dropout β Dense (Sigmoid)
 - Parameters: 1.4M trainable parameters
 - Cell State: Forget gate, input gate, output gate
 - Learning: Long-term dependency for sentiment nuances
 
- Python: 3.12 (stable version)
 - Deep Learning: TensorFlow 2.20.0
 - Data Processing: NumPy 1.26.x, Pandas 2.3.3
 - Visualization: Matplotlib 3.10.7, Seaborn 0.13.2, WordCloud 1.9.4
 - NLP: NLTK 3.9.2
 - ML Utils: scikit-learn 1.7.2
 
imdb-sentiment-lstm/
βββ .venv/                                                                          # Virtual environment
βββ data/
β   βββ imdb_dataset.csv                                                            # Original dataset (50K reviews)
β   βββ imdb_dataset_formatted.csv                                                  # HTML tags removed
β   βββ imdb_dataset_cleaned.csv                                                    # Final cleaned (49,578 reviews)
β   βββ X_train_preprocessed.npy                                                    # Preprocessed training sequences
β   βββ X_val_preprocessed.npy                                                      # Preprocessed validation sequences
β   βββ y_train.npy                                                                 # Training labels
β   βββ y_val.npy                                                                   # Validation labels
βββ doc/
|   βββimbd_sentiment_analysis_project_presentation_d18zgx_vadasz_csaba.pptx        # Hungarian presentation
β   βββ imbd_sentiment_analysis_project_documentation_d18zgx_vadasz_csaba.pdf       # Hungarian doc
βββ models/
β   βββ tokenizer.pickle                                                            # Keras tokenizer (vocab: 10K)
β   βββ lstm_sentiment_model.h5                                                     # Trained model
βββ notebooks/                                                                      # Jupyter notebooks for experiments
βββ visualizations/
β   βββ eda/                                                                        # Exploratory Data Analysis plots (7)
β   βββ preprocessing/                                                              # Preprocessing visualizations (2)
β   βββ training/                                                                   # Training history plots & model architecture
βββ src/
β   βββ __init__.py
β   βββ check_versions.py                                                           # PyPI version checker
β   βββ config.py                                                                   # Configuration & hyperparameters
β   βββ data_clean.py                                                               # Data cleaning & EDA
β   βββ data_inspect.py                                                             # Initial data inspection
β   βββ data_format.py                                                              # HTML tag removal
β   βββ data_loader.py                                                              # Data loading & train/val split
β   βββ data_preprocess.py                                                          # Tokenization & padding
β   βββ model.py                                                                    # LSTM model architecture
βββ .gitignore
βββ img.png                                                                         # # Self-generated AI image (DALL-E 3)
βββ LICENSE                                                                         # MIT License
βββ main.py                                                                         # Main entry point
βββ README.md
βββ requirements.txt                                                                # Packages to install with versions
git clone <your-repo-url>
cd imdb-sentiment-lstm# Windows
python -m venv .venv
.venv\Scripts\activate
# Mac/Linux
python -m venv .venv
source .venv/bin/activatepip install -r requirements.txtIf prompted to update pip:
python -m pip install --upgrade pipExplore the raw dataset structure and basic statistics.
python src/data_inspect.pyOutput:
- Dataset info (50,000 rows, 2 columns)
 - First 5 samples
 - Sentiment distribution
 - HTML tag detection
 
Remove HTML tags and format text for analysis.
python src/data_format.pyOutput:
- Cleaned reviews (HTML tags removed)
 - Saved to 
data/imdb_dataset_formatted.csv 
Comprehensive data cleaning and exploratory analysis.
python src/data_clean.pyWhat it does:
- β Missing values check: 0 missing values found
 - β Duplicate removal: 422 duplicates removed (0.84%)
 - β Sentiment validation: 0 invalid values found
 - β Text length analysis: Character & word counts
 - β Outlier detection: IQR method (7.39% outliers kept)
 - β Descriptive statistics: Mean, median, std, min, max
 - β
 7 Visualizations created:
- Sentiment distribution (bar chart)
 - Text length histogram (word & character count)
 - Text length boxplot (by sentiment)
 - Word clouds (positive & negative)
 - Top 20 frequent words (positive & negative)
 
 
Output:
data/imdb_dataset_cleaned.csv(49,578 reviews)- 7 PNG visualizations in 
visualizations/eda/ 
Key Statistics:
Total Reviews:     49,578
Positive:          24,882 (50.18%)
Negative:          24,696 (49.82%)
Avg Word Count:    229 words
Median Word Count: 172 words
Tokenization, sequence padding, and train/validation split.
python src/data_preprocess.pyWhat it does:
- β Tokenization: Convert text to integer sequences
 - β Vocabulary: Top 10,000 most frequent words
 - β Padding: All sequences padded/truncated to 200 tokens
 - β Train/Val Split: 80/20 stratified split (39,662 / 9,916)
 - β Save preprocessed data: Arrays saved as .npy files (38.2 MB)
 - β
 2 Visualizations created:
- Sequence length distribution (train & val)
 - Vocabulary statistics (Zipf's law)
 
 
Output:
models/tokenizer.pickle(4.7 MB)data/X_train_preprocessed.npy(30.26 MB)data/X_val_preprocessed.npy(7.57 MB)data/y_train.npy(309 KB)data/y_val.npy(77 KB)- 2 PNG visualizations in 
visualizations/preprocessing/ 
Key Statistics:
Training Set:      39,662 samples (80%)
Validation Set:     9,916 samples (20%)
Vocabulary Size:   10,000 words
Sequence Length:   200 tokens (padded/truncated)
Padding:           58.9% padded, 40.8% truncated
Build and compile LSTM architecture.
python src/model.pyArchitecture:
Input (batch_size, 200)
    β
Embedding Layer (vocab_size=10K, embedding_dim=128)
    β
LSTM Layer (128 units, dropout=0.5, recurrent_dropout=0.2)
    β
Dropout Layer (0.5)
    β
Dense Output (1 unit, sigmoid activation)
    β
Output (batch_size, 1) - probability [0=negative, 1=positive]
Model Summary:
Total Parameters:     1,411,713 (5.39 MB)
Trainable Parameters: 1,411,713
Layer Breakdown:
  - Embedding:        1,280,000 params
  - LSTM:               131,584 params
  - Dense:                  129 params
Output:
visualizations/training/model_architecture.jsonvisualizations/training/model_config.jsonvisualizations/training/model_architecture.png
- Train LSTM model (10 epochs, batch_size=64)
 - Early stopping with patience=3
 - Save training history & plots
 - Save trained model
 
- Evaluate on validation set
 - Confusion matrix
 - Classification report
 - Sample predictions
 
After cleaning, our dataset shows excellent characteristics for training:
- Perfect Balance: 50.18% positive vs 49.82% negative (no resampling needed!)
 - Good Text Length Distribution: Average 229 words, suitable for LSTM
 - Minimal Duplicates: Only 0.84% removed
 - No Missing Data: 100% complete dataset
 - Outliers Kept: 7.39% long/short reviews retained (may contain valuable sentiment information)
 
Check the visualizations in visualizations/eda/ for detailed insights! π
Created as part of Deep Learning coursework at University of Pannonia.
Developed with focus on understanding LSTM mechanisms and practical NLP implementation.
Developed by Csaba79-coder | Csaba VadΓ‘sz
MIT License
Solution: Make sure you're in the virtual environment:
# Windows
.venv\Scripts\activate
# Mac/Linux
source .venv/bin/activateSolution: We use NumPy 1.26.x (not 2.x) for TensorFlow compatibility. This is intentional and stable.
Solution: Install separately if needed:
pip install wordcloud==1.9.4Solution: Run scripts from project root:
python src/data_clean.py  # β
 Correct
cd src && python data_clean.py  # β Wrong