This project implements various Natural Language Processing (NLP) techniques for document representation and clustering using Persian text data. The project explores different approaches to convert documents into vector representations and evaluates their effectiveness in document clustering tasks.
This project is structured as a comprehensive NLP assignment that implements four different document representation methods:
- Section 1: Word2Vec-based document representation with stop word removal
- Section 2: TF-IDF weighted Word2Vec document representation
- Section 3: Doc2Vec document representation with stop word removal
- Section 4: Latent Semantic Analysis (LSA) using SVD
nlp-document-representation/
├── Data/
│ ├── HAM-Train.txt # Training dataset
│ └── HAM-Test.txt # Test dataset
├── HW2_97131099/
│ ├── HW2_97131099.docx # Assignment document
│ └── HW2_97131099.pdf # Assignment PDF
├── Untitled1.ipynb # Section 1: Word2Vec + Stop Words
├── Untitled2.ipynb # Section 2: TF-IDF + Word2Vec
├── Untitled3.ipynb # Section 3: Doc2Vec + Stop Words
├── Untitled4.ipynb # Section 4: LSA with SVD
├── Stop_words.txt # Persian stop words (1299 words)
├── Stop_words2.txt # Additional stop words (188 words)
├── puredInput.txt # Preprocessed input data
├── puredInputSectionOne.txt # Section 1 preprocessed data
├── puredInputSectionTwo.txt # Section 2 preprocessed data
├── puredInputSectionThree.txt # Section 3 preprocessed data
├── puredInputSectionFour.txt # Section 4 preprocessed data
└── README.md # This file
The project uses the HAM (Hamshahri) dataset, which contains Persian news articles categorized into 5 classes:
- اقتصاد (Economy)
- سیاسی (Political)
- اجتماعی (Social)
- ادب و هنر (Literature and Art)
- ورزش (Sports)
- Training Set: 7,740 documents
- Test Set: 1,000 documents
- Total Documents: 8,740 documents
- Classes: 5 categories
- Method: Word2Vec (300 dimensions) + document averaging
- Preprocessing: Stop word removal using Persian stop words
- Clustering: K-means (k=5)
- Features:
- Window size: 5
- Minimum count: 1
- Workers: 4
- Method: Word2Vec + TF-IDF weighting
- Preprocessing: No stop word removal
- Weighting: TF-IDF scores for each word
- Clustering: K-means (k=5)
- Features:
- TF calculation: 1 + log10(term_frequency)
- IDF calculation: log10(total_documents/document_frequency)
- Method: Doc2Vec (300 dimensions)
- Preprocessing: Stop word removal
- Model: Distributed Memory (DM=0)
- Clustering: K-means (k=5)
- Features:
- Vector size: 300
- Window size: 5
- Minimum count: 1
- Method: SVD-based dimensionality reduction
- Preprocessing: Stop word removal
- Vectorization: CountVectorizer
- Dimensionality: 300 components
- Clustering: K-means (k=5)
The project requires the following Python packages:
nltk
gensim
scikit-learn
numpy
pandas
scipy
Each section evaluates the clustering performance using:
- V-measure Score: Harmonic mean of homogeneity and completeness
- Cluster Distribution Analysis: Distribution of documents across clusters
- Class-wise Analysis: Performance per document category
-
Setup Environment:
pip install nltk gensim scikit-learn numpy pandas scipy
-
Download NLTK Data:
import nltk nltk.download('punkt')
-
Run Notebooks:
- Open each Jupyter notebook in sequence
- Execute cells to run the respective section
- Results and visualizations will be displayed
- Word2Vec with stop word removal provides baseline performance
- TF-IDF weighting improves document representation by considering term importance
- Doc2Vec captures document-level semantics effectively
- LSA offers dimensionality reduction while preserving semantic information
The project implements comprehensive preprocessing:
- Tokenization: Using NLTK's word tokenizer
- Stop Word Removal: Using Persian-specific stop words
- Text Cleaning: Removing special characters and formatting
- Document Separation: Using '@@@@@@@@@@' delimiter
- The project is designed for Persian text processing
- All notebooks include detailed comments and explanations
- Results are saved in separate text files for further analysis
- The implementation follows best practices for NLP document representation
This project is part of an academic assignment for NLP course.
For detailed implementation and results, please refer to the individual Jupyter notebooks.