Persian NLP Pipeline: POS Tagging & Named Entity Recognition

A comprehensive Persian Natural Language Processing pipeline that implements Part-of-Speech (POS) tagging and Named Entity Recognition (NER) using machine learning approaches. This project processes Persian text with UTF-8 encoding and provides accurate linguistic analysis for Persian language applications.

✨ Features

POS Tagging System

Advanced MLP Classifier: Multilayer Perceptron neural network for accurate POS prediction
Contextual Feature Engineering:
- Word position analysis (first/last in sentence)
- Previous and next word context
- Morphological features (hyphenation, numeric detection)
- Word-level characteristics
Persian Language Support: Optimized for Persian text processing
High Accuracy: Robust performance on Persian text corpora

Named Entity Recognition

Stanford NLP Integration: Leverages Stanford NLP toolkit for NER
Custom Persian Model: Pre-trained model specifically for Persian entities
Entity Types: Person names, locations, organizations, and other named entities
Comprehensive Evaluation: Entity-level precision, recall, and F1-score metrics

Data Pipeline

UTF-8 Encoding: Full support for Persian Unicode characters
Automated Preprocessing: Sentence boundary detection and word-tag separation
Feature Extraction: Advanced contextual feature generation
Model Persistence: Save and load trained models for deployment

🚀 Installation

Prerequisites

Python 3.7 or higher
Java Runtime Environment (JRE) 1.8+ (for Stanford NER)
Git

Setup Instructions

Clone the repository

git clone https://github.com/yourusername/nlp-postagger-ner.git
cd nlp-postagger-ner

Install Python dependencies
```
pip install -r requirements.txt
```

Download NLTK data

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Set up Java environment (for Stanford NER)
- Ensure Java is installed and JAVA_HOME is set
- The project includes stanford-ner.jar and trained_model.ser.gz

📁 Project Structure

nlp-postagger-ner/
├── Data/                          # Persian text corpora
│   ├── POStrutf.txt              # POS training data (UTF-8)
│   ├── POSteutf.txt              # POS test data (UTF-8)
│   ├── NERtr.txt                 # NER training data
│   ├── NERte.txt                 # NER test data
│   ├── in.txt                    # Sample input
│   └── out.txt                   # Sample output
├── Section1_POS.ipynb            # POS tagging implementation
├── Section2_NER.ipynb            # NER implementation
├── NNModel.joblib                # Serialized POS model
├── trained_model.ser.gz          # Stanford NER model
├── stanford-ner.jar              # Stanford NER JAR file
├── Report.pdf                    # Detailed project report
├── SNLP_HW3.pdf                  # Assignment specification
└── README.md                     # This file

💻 Usage

POS Tagging

Open the POS notebook
```
jupyter notebook Section1_POS.ipynb
```
Run the cells sequentially to:
- Load and preprocess Persian text data
- Extract contextual features
- Train the MLP classifier
- Evaluate model performance
- Save the trained model

Named Entity Recognition

Open the NER notebook
```
jupyter notebook Section2_NER.ipynb
```
Execute the cells to:
- Load the pre-trained Stanford NER model
- Process test data
- Perform entity recognition
- Calculate evaluation metrics

Using Pre-trained Models

# Load POS model
from joblib import load
pos_model = load('NNModel.joblib')
# Load NER model
from nltk.tag.stanford import StanfordNERTagger
ner_tagger = StanfordNERTagger('trained_model.ser.gz', 'stanford-ner.jar', encoding='utf8')
# Process Persian text
text = "متن فارسی برای پردازش"
# Apply POS tagging and NER

🔧 Technical Details

Technologies Used

Python 3.7+: Core implementation language
scikit-learn: Machine learning pipeline and MLPClassifier
NLTK: Natural language processing toolkit
Stanford NLP: Named entity recognition framework
pandas & numpy: Data manipulation and numerical operations
joblib: Model serialization and persistence
Jupyter Notebooks: Interactive development environment

Feature Engineering (POS Tagging)

The POS tagging system uses sophisticated feature extraction:

def features(sentence, index):
    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'prev_word': '' if index == 0 else sentence[index - 1],
        'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
        'has_hyphen': '-' in sentence[index],
        'is_numeric': sentence[index].isdigit(),
    }

Data Format

Training Data: Tab-separated word-tag pairs
Encoding: UTF-8 for Persian character support
Sentence Boundaries: Marked with special tokens
Entity Labels: BIO tagging scheme for NER

POS Tagging Metrics

Accuracy: High performance on Persian text
Confusion Matrix: Detailed analysis of tag predictions
Cross-validation: Robust model evaluation

NER Performance

Precision: Entity-level precision metrics
Recall: Comprehensive entity detection
F1-Score: Balanced performance measure
Entity Types: Support for multiple entity categories

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Persian NLP Pipeline: POS Tagging & Named Entity Recognition

✨ Features

POS Tagging System

Named Entity Recognition

Data Pipeline

🚀 Installation

Prerequisites

Setup Instructions

📁 Project Structure

💻 Usage

POS Tagging

Named Entity Recognition

Using Pre-trained Models

🔧 Technical Details

Technologies Used

Feature Engineering (POS Tagging)

Data Format

POS Tagging Metrics

NER Performance

📄 License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Data		Data
NNModel.joblib		NNModel.joblib
README.md		README.md
Report.pdf		Report.pdf
SNLP_HW3.pdf		SNLP_HW3.pdf
Section1_POS.ipynb		Section1_POS.ipynb
Section2_NER.ipynb		Section2_NER.ipynb
stanford-ner.jar		stanford-ner.jar
trained_model.ser.gz		trained_model.ser.gz

ghorbani-mohammad/nlp-postagger-ner

Folders and files

Latest commit

History

Repository files navigation

Persian NLP Pipeline: POS Tagging & Named Entity Recognition

✨ Features

POS Tagging System

Named Entity Recognition

Data Pipeline

🚀 Installation

Prerequisites

Setup Instructions

📁 Project Structure

💻 Usage

POS Tagging

Named Entity Recognition

Using Pre-trained Models

🔧 Technical Details

Technologies Used

Feature Engineering (POS Tagging)

Data Format

POS Tagging Metrics

NER Performance

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages