A comprehensive Persian Natural Language Processing pipeline that implements Part-of-Speech (POS) tagging and Named Entity Recognition (NER) using machine learning approaches. This project processes Persian text with UTF-8 encoding and provides accurate linguistic analysis for Persian language applications.
- Advanced MLP Classifier: Multilayer Perceptron neural network for accurate POS prediction
- Contextual Feature Engineering:
- Word position analysis (first/last in sentence)
- Previous and next word context
- Morphological features (hyphenation, numeric detection)
- Word-level characteristics
- Persian Language Support: Optimized for Persian text processing
- High Accuracy: Robust performance on Persian text corpora
- Stanford NLP Integration: Leverages Stanford NLP toolkit for NER
- Custom Persian Model: Pre-trained model specifically for Persian entities
- Entity Types: Person names, locations, organizations, and other named entities
- Comprehensive Evaluation: Entity-level precision, recall, and F1-score metrics
- UTF-8 Encoding: Full support for Persian Unicode characters
- Automated Preprocessing: Sentence boundary detection and word-tag separation
- Feature Extraction: Advanced contextual feature generation
- Model Persistence: Save and load trained models for deployment
- Python 3.7 or higher
- Java Runtime Environment (JRE) 1.8+ (for Stanford NER)
- Git
-
Clone the repository
git clone https://github.com/yourusername/nlp-postagger-ner.git cd nlp-postagger-ner
-
Install Python dependencies
pip install -r requirements.txt
-
Download NLTK data
import nltk nltk.download('punkt') nltk.download('averaged_perceptron_tagger')
-
Set up Java environment (for Stanford NER)
- Ensure Java is installed and
JAVA_HOME
is set - The project includes
stanford-ner.jar
andtrained_model.ser.gz
- Ensure Java is installed and
nlp-postagger-ner/
βββ Data/ # Persian text corpora
β βββ POStrutf.txt # POS training data (UTF-8)
β βββ POSteutf.txt # POS test data (UTF-8)
β βββ NERtr.txt # NER training data
β βββ NERte.txt # NER test data
β βββ in.txt # Sample input
β βββ out.txt # Sample output
βββ Section1_POS.ipynb # POS tagging implementation
βββ Section2_NER.ipynb # NER implementation
βββ NNModel.joblib # Serialized POS model
βββ trained_model.ser.gz # Stanford NER model
βββ stanford-ner.jar # Stanford NER JAR file
βββ Report.pdf # Detailed project report
βββ SNLP_HW3.pdf # Assignment specification
βββ README.md # This file
-
Open the POS notebook
jupyter notebook Section1_POS.ipynb
-
Run the cells sequentially to:
- Load and preprocess Persian text data
- Extract contextual features
- Train the MLP classifier
- Evaluate model performance
- Save the trained model
-
Open the NER notebook
jupyter notebook Section2_NER.ipynb
-
Execute the cells to:
- Load the pre-trained Stanford NER model
- Process test data
- Perform entity recognition
- Calculate evaluation metrics
# Load POS model
from joblib import load
pos_model = load('NNModel.joblib')
# Load NER model
from nltk.tag.stanford import StanfordNERTagger
ner_tagger = StanfordNERTagger('trained_model.ser.gz', 'stanford-ner.jar', encoding='utf8')
# Process Persian text
text = "Ω
ΨͺΩ ΩΨ§Ψ±Ψ³Ϋ Ψ¨Ψ±Ψ§Ϋ ΩΎΨ±Ψ―Ψ§Ψ²Ψ΄"
# Apply POS tagging and NER
- Python 3.7+: Core implementation language
- scikit-learn: Machine learning pipeline and MLPClassifier
- NLTK: Natural language processing toolkit
- Stanford NLP: Named entity recognition framework
- pandas & numpy: Data manipulation and numerical operations
- joblib: Model serialization and persistence
- Jupyter Notebooks: Interactive development environment
The POS tagging system uses sophisticated feature extraction:
def features(sentence, index):
return {
'word': sentence[index],
'is_first': index == 0,
'is_last': index == len(sentence) - 1,
'prev_word': '' if index == 0 else sentence[index - 1],
'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
'has_hyphen': '-' in sentence[index],
'is_numeric': sentence[index].isdigit(),
}
- Training Data: Tab-separated word-tag pairs
- Encoding: UTF-8 for Persian character support
- Sentence Boundaries: Marked with special tokens
- Entity Labels: BIO tagging scheme for NER
- Accuracy: High performance on Persian text
- Confusion Matrix: Detailed analysis of tag predictions
- Cross-validation: Robust model evaluation
- Precision: Entity-level precision metrics
- Recall: Comprehensive entity detection
- F1-Score: Balanced performance measure
- Entity Types: Support for multiple entity categories
This project is licensed under the MIT License - see the LICENSE file for details.