NLP Information Gain Project

A comprehensive Natural Language Processing project focused on text classification using information gain feature selection and various language models (unigram, bigram) for Persian text classification.

📋 Project Overview

This project implements text classification algorithms for Persian news articles using different approaches:

Information Gain Feature Selection: Calculates the most informative words for classification
Unigram Model: Basic word-based classification without feature selection
Unigram with Feature Selection: Enhanced classification using top 200 most informative words
Bigram Model: Word pair-based classification for better context understanding

🎯 Features

Multi-class Classification: Supports multiple document categories (politics, sports, economy, social, arts)
Information Gain Calculation: Identifies the most discriminative words for classification
Stop Word Removal: Preprocesses text by removing common stop words
Multiple Smoothing Parameters: Tests different delta values (0.1, 0.3, 0.5) for model optimization
Confusion Matrix Generation: Provides detailed performance metrics
Persian Text Support: Specifically designed for Persian language processing

📁 File Structure

nlp-information-gain/
├── calculate_information_gain.py    # Information gain calculation
├── unigram.py                       # Basic unigram classification
├── unigram_with_feature_selection.py # Unigram with feature selection
├── bigram.py                        # Bigram classification model
├── removeStopWords.py               # Stop word removal preprocessing
├── get200MaxInformation.python.py   # Extract top 200 informative words
├── HAM-Train.txt                    # Training dataset
├── HAM-Test.txt                     # Test dataset
├── puredInput.txt                   # Preprocessed training data
├── Stop_words.txt                   # Persian stop words list
├── information_gain_calculated.txt  # Calculated information gain scores
└── Probably not matter/             # Additional output files

🚀 Usage

Prerequisites

Install required Python packages:

pip install nltk hazm numpy

Data Format

The input files should follow this format:

category@@@@@@@@@@ document_text

Example:

سیاسی@@@@@@@@@@ متن خبر سیاسی
ورزش@@@@@@@@@@ متن خبر ورزشی

Running the Models

Preprocess Data (Remove stop words):

python removeStopWords.py

Calculate Information Gain:

python calculate_information_gain.py

Run Unigram Classification:

python unigram.py

Run Unigram with Feature Selection:

python unigram_with_feature_selection.py

Run Bigram Classification:

python bigram.py

Extract Top 200 Words (Optional):

python get200MaxInformation.python.py

📊 Model Details

Information Gain Calculation

Calculates entropy-based information gain for each word
Identifies words that best distinguish between document categories
Outputs sorted list of words by information gain score

Unigram Model

Uses individual word frequencies for classification
Implements Laplace smoothing with configurable delta values
Calculates log-likelihood scores for each document category

Bigram Model

Uses word pair frequencies for better context understanding
Implements background probability smoothing
Considers word sequence patterns in classification

Feature Selection

Selects top 200 words with highest information gain
Reduces feature space for improved performance
Maintains classification accuracy with fewer features

📈 Performance Metrics

The models generate confusion matrices and calculate:

True Positives (TP)
False Positives (FP)
False Negatives (FN)
Overall accuracy for each delta value

🔧 Configuration

Smoothing Parameters

Delta values: [0.1, 0.3, 0.5]
Adjustable in each model file
Affects model performance and generalization

Feature Selection

Number of top words: 200 (configurable)
Based on information gain scores
Can be modified in get200MaxInformation.python.py

📝 Output Files

information_gain_calculated.txt: Word scores sorted by information gain
puredInput.txt: Preprocessed training data without stop words
Various confusion matrix and performance files in Probably not matter/ directory

🛠️ Technical Implementation

Libraries Used

NLTK: Natural language processing and tokenization
Hazm: Persian text processing
NumPy: Numerical computations
Collections: Data structures for frequency counting

Algorithm Details

Entropy Calculation: Uses log base 2 for information gain
Smoothing: Laplace smoothing with configurable parameters
Tokenization: NLTK word tokenization for Persian text
Matrix Operations: Efficient frequency counting using defaultdict

📚 Academic Context

This project appears to be an academic assignment (HW1) focused on:

Information theory in NLP
Feature selection methods
Text classification algorithms
Persian language processing

🤝 Contributing

This is an academic project, but suggestions and improvements are welcome. Please ensure any modifications maintain the project's educational value and Persian language support.

📄 License

This project is for educational purposes. Please respect academic integrity guidelines when using this code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP Information Gain Project

📋 Project Overview

🎯 Features

📁 File Structure

🚀 Usage

Prerequisites

Data Format

Running the Models

📊 Model Details

Information Gain Calculation

Unigram Model

Bigram Model

Feature Selection

📈 Performance Metrics

🔧 Configuration

Smoothing Parameters

Feature Selection

📝 Output Files

🛠️ Technical Implementation

Libraries Used

Algorithm Details

📚 Academic Context

🤝 Contributing

📄 License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Probably not matter		Probably not matter
HAM-Test.txt		HAM-Test.txt
HAM-Train.txt		HAM-Train.txt
HW1_97131099.docx		HW1_97131099.docx
HW1_97131099.pdf		HW1_97131099.pdf
HW1_97131099.zip		HW1_97131099.zip
README.md		README.md
Stop_words.txt		Stop_words.txt
bigram.py		bigram.py
calculate_information_gain.py		calculate_information_gain.py
get200MaxInformation.python.py		get200MaxInformation.python.py
information_gain_calculated.txt		information_gain_calculated.txt
puredInput.txt		puredInput.txt
removeStopWords.py		removeStopWords.py
unigram.py		unigram.py
unigram_with_feature_selection.py		unigram_with_feature_selection.py

ghorbani-mohammad/nlp-information-gain

Folders and files

Latest commit

History

Repository files navigation

NLP Information Gain Project

📋 Project Overview

🎯 Features

📁 File Structure

🚀 Usage

Prerequisites

Data Format

Running the Models

📊 Model Details

Information Gain Calculation

Unigram Model

Bigram Model

Feature Selection

📈 Performance Metrics

🔧 Configuration

Smoothing Parameters

Feature Selection

📝 Output Files

🛠️ Technical Implementation

Libraries Used

Algorithm Details

📚 Academic Context

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages