NLP Document Representation Project

This project implements various Natural Language Processing (NLP) techniques for document representation and clustering using Persian text data. The project explores different approaches to convert documents into vector representations and evaluates their effectiveness in document clustering tasks.

📋 Project Overview

This project is structured as a comprehensive NLP assignment that implements four different document representation methods:

Section 1: Word2Vec-based document representation with stop word removal
Section 2: TF-IDF weighted Word2Vec document representation
Section 3: Doc2Vec document representation with stop word removal
Section 4: Latent Semantic Analysis (LSA) using SVD

🗂️ Project Structure

nlp-document-representation/
├── Data/
│   ├── HAM-Train.txt          # Training dataset
│   └── HAM-Test.txt           # Test dataset
├── HW2_97131099/
│   ├── HW2_97131099.docx      # Assignment document
│   └── HW2_97131099.pdf       # Assignment PDF
├── Untitled1.ipynb            # Section 1: Word2Vec + Stop Words
├── Untitled2.ipynb            # Section 2: TF-IDF + Word2Vec
├── Untitled3.ipynb            # Section 3: Doc2Vec + Stop Words
├── Untitled4.ipynb            # Section 4: LSA with SVD
├── Stop_words.txt             # Persian stop words (1299 words)
├── Stop_words2.txt            # Additional stop words (188 words)
├── puredInput.txt             # Preprocessed input data
├── puredInputSectionOne.txt   # Section 1 preprocessed data
├── puredInputSectionTwo.txt   # Section 2 preprocessed data
├── puredInputSectionThree.txt # Section 3 preprocessed data
├── puredInputSectionFour.txt  # Section 4 preprocessed data
└── README.md                  # This file

🎯 Dataset

The project uses the HAM (Hamshahri) dataset, which contains Persian news articles categorized into 5 classes:

اقتصاد (Economy)
سیاسی (Political)
اجتماعی (Social)
ادب و هنر (Literature and Art)
ورزش (Sports)

Dataset Statistics

Training Set: 7,740 documents
Test Set: 1,000 documents
Total Documents: 8,740 documents
Classes: 5 categories

🔧 Implementation Details

Section 1: Word2Vec with Stop Word Removal

Method: Word2Vec (300 dimensions) + document averaging
Preprocessing: Stop word removal using Persian stop words
Clustering: K-means (k=5)
Features:
- Window size: 5
- Minimum count: 1
- Workers: 4

Section 2: TF-IDF Weighted Word2Vec

Method: Word2Vec + TF-IDF weighting
Preprocessing: No stop word removal
Weighting: TF-IDF scores for each word
Clustering: K-means (k=5)
Features:
- TF calculation: 1 + log10(term_frequency)
- IDF calculation: log10(total_documents/document_frequency)

Section 3: Doc2Vec with Stop Word Removal

Method: Doc2Vec (300 dimensions)
Preprocessing: Stop word removal
Model: Distributed Memory (DM=0)
Clustering: K-means (k=5)
Features:
- Vector size: 300
- Window size: 5
- Minimum count: 1

Section 4: Latent Semantic Analysis (LSA)

Method: SVD-based dimensionality reduction
Preprocessing: Stop word removal
Vectorization: CountVectorizer
Dimensionality: 300 components
Clustering: K-means (k=5)

🛠️ Dependencies

The project requires the following Python packages:

nltk
gensim
scikit-learn
numpy
pandas
scipy

📊 Evaluation Metrics

Each section evaluates the clustering performance using:

V-measure Score: Harmonic mean of homogeneity and completeness
Cluster Distribution Analysis: Distribution of documents across clusters
Class-wise Analysis: Performance per document category

🚀 Usage

Setup Environment:

pip install nltk gensim scikit-learn numpy pandas scipy

Download NLTK Data:
```
import nltk
nltk.download('punkt')
```
Run Notebooks:
- Open each Jupyter notebook in sequence
- Execute cells to run the respective section
- Results and visualizations will be displayed

📈 Key Findings

Word2Vec with stop word removal provides baseline performance
TF-IDF weighting improves document representation by considering term importance
Doc2Vec captures document-level semantics effectively
LSA offers dimensionality reduction while preserving semantic information

🔍 Data Preprocessing

The project implements comprehensive preprocessing:

Tokenization: Using NLTK's word tokenizer
Stop Word Removal: Using Persian-specific stop words
Text Cleaning: Removing special characters and formatting
Document Separation: Using '@@@@@@@@@@' delimiter

📝 Notes

The project is designed for Persian text processing
All notebooks include detailed comments and explanations
Results are saved in separate text files for further analysis
The implementation follows best practices for NLP document representation

📄 License

This project is part of an academic assignment for NLP course.

For detailed implementation and results, please refer to the individual Jupyter notebooks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP Document Representation Project

📋 Project Overview

🗂️ Project Structure

🎯 Dataset

Dataset Statistics

🔧 Implementation Details

Section 1: Word2Vec with Stop Word Removal

Section 2: TF-IDF Weighted Word2Vec

Section 3: Doc2Vec with Stop Word Removal

Section 4: Latent Semantic Analysis (LSA)

🛠️ Dependencies

📊 Evaluation Metrics

🚀 Usage

📈 Key Findings

🔍 Data Preprocessing

📝 Notes

📄 License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Data		Data
HW2_97131099		HW2_97131099
HW2_97131099.pdf		HW2_97131099.pdf
HW2_97131099.zip		HW2_97131099.zip
README.md		README.md
Stop_words.txt		Stop_words.txt
Stop_words2.txt		Stop_words2.txt
Untitled1.ipynb		Untitled1.ipynb
Untitled2.ipynb		Untitled2.ipynb
Untitled3.ipynb		Untitled3.ipynb
Untitled4.ipynb		Untitled4.ipynb
puredInput.txt		puredInput.txt
puredInputSectionFour.txt		puredInputSectionFour.txt
puredInputSectionOne.txt		puredInputSectionOne.txt
puredInputSectionOne_test.txt		puredInputSectionOne_test.txt
puredInputSectionThree.txt		puredInputSectionThree.txt
puredInputSectionThree_test.txt		puredInputSectionThree_test.txt
puredInputSectionTwo.txt		puredInputSectionTwo.txt
puredInputSectionTwo_test.txt		puredInputSectionTwo_test.txt
puredInputWithStopWords.txt		puredInputWithStopWords.txt

ghorbani-mohammad/nlp-document-representation

Folders and files

Latest commit

History

Repository files navigation

NLP Document Representation Project

📋 Project Overview

🗂️ Project Structure

🎯 Dataset

Dataset Statistics

🔧 Implementation Details

Section 1: Word2Vec with Stop Word Removal

Section 2: TF-IDF Weighted Word2Vec

Section 3: Doc2Vec with Stop Word Removal

Section 4: Latent Semantic Analysis (LSA)

🛠️ Dependencies

📊 Evaluation Metrics

🚀 Usage

📈 Key Findings

🔍 Data Preprocessing

📝 Notes

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages