NLP Keyword Extraction Project

A Persian/Farsi text analysis tool that extracts keywords and keyphrases from text documents using the TextRank algorithm with POS tagging and natural language processing techniques.

🎯 Project Overview

This project implements an advanced keyword extraction system specifically designed for Persian/Farsi text processing. It uses the TextRank algorithm combined with Part-of-Speech (POS) tagging to identify the most important words and phrases in text documents.

✨ Features

Persian/Farsi Text Processing: Optimized for Persian language with proper text normalization
Keyword Extraction: Extracts individual keywords with their importance scores
Keyphrase Extraction: Identifies meaningful multi-word phrases
POS Tagging: Uses Hazm library for accurate Persian POS tagging
TextRank Algorithm: Implements the TextRank algorithm for ranking words and phrases
Stop Word Filtering: Comprehensive Persian stop word list for better results
Batch Processing: Processes multiple text files automatically
Configurable Parameters: Adjustable window size, damping coefficient, and convergence threshold

🏗️ Project Structure

nlp-keyword-extraction/
├── Input/                          # Input text files
│   ├── wiki_fa_1_NeuralNetwork.txt
│   ├── wiki_fa_2_DataBase.txt
│   └── wiki_fa_3_NLP.txt
├── Output/                         # Generated keyword and phrase files
│   ├── WordOut_*.txt              # Individual keywords with scores
│   └── PhraseOut_*.txt            # Keyphrases with scores
├── resources/                      # NLP models and libraries
│   ├── postagger.model            # POS tagging model
│   ├── chunker.model              # Chunking model
│   ├── langModel.mco              # Language model
│   ├── malt.jar                   # MaltParser
│   └── lib/                       # Java libraries
├── source.py                      # Main application code
├── Stop_words.txt                 # Persian stop words list
└── README.md                      # This file

🚀 Installation

Prerequisites

Python 3.6 or higher
Java Runtime Environment (JRE) for MaltParser

Dependencies

Install the required Python packages:

pip install hazm numpy

Setup

Clone or download this repository
Ensure all resource files are in the resources/ directory
Make sure Java is installed and accessible from command line

📖 Usage

Basic Usage

Place your Persian text files in the Input/ directory
Run the main script:

python source.py

Check the Output/ directory for results:
- WordOut_[filename].txt: Contains individual keywords with scores
- PhraseOut_[filename].txt: Contains keyphrases with scores

Output Format

Keywords Output:

عصبی - 2.9834683173198893
نورون ها - 2.4284856373486843
نورون - 2.3584436738484387
...

Keyphrases Output:

 نورون ها تشکیل - 2.3479472902508887
 شبکهٔ عصبی - 2.2743010535260852
 نورون های لایه های - 1.949765382008226
...

🔧 Configuration

The TextRank algorithm parameters can be adjusted in the TextRank4Keyword class:

d: Damping coefficient (default: 0.85)
min_diff: Convergence threshold (default: 1e-5)
steps: Maximum iteration steps (default: 10)
Window size for token pairs (default: 4)

🧠 Algorithm Details

TextRank Implementation

The project implements the TextRank algorithm with the following steps:

Text Preprocessing: Normalization and sentence tokenization
POS Tagging: Using Hazm library for Persian POS tagging
Candidate Selection: Filtering words based on POS tags (N, V, Ne, AJ, AJe)
Graph Construction: Building co-occurrence graph with configurable window size
Ranking: Applying PageRank algorithm to rank words and phrases
Scoring: Calculating importance scores for keywords and keyphrases

Supported POS Tags

N: Nouns
V: Verbs
Ne: Proper nouns
AJ: Adjectives
AJe: Comparative adjectives

📊 Example Results

For a text about neural networks, the system extracts:

Top Keywords:

عصبی (Neural) - 2.98
نورون ها (Neurons) - 2.43
نورون (Neuron) - 2.36

Top Keyphrases:

نورون ها تشکیل (Neurons formation) - 2.35
شبکهٔ عصبی (Neural network) - 2.27
نورون های لایه های (Neurons of layers) - 1.95

🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for:

Bug fixes
Performance improvements
Additional language support
New features

📝 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

Hazm - Persian NLP library
TextRank - Original algorithm paper
MaltParser - For dependency parsing capabilities

📞 Support

If you encounter any issues or have questions, please:

Check the existing issues in the repository
Create a new issue with detailed description
Include sample input and expected output

Note: This project is specifically optimized for Persian/Farsi text processing and may require adjustments for other languages.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Input - Copy		Input - Copy
Input		Input
Output - Copy		Output - Copy
Output		Output
resources		resources
.gitignore		.gitignore
Project Report.docx		Project Report.docx
Project Report.pdf		Project Report.pdf
README.md		README.md
Stop_words.txt		Stop_words.txt
source.py		source.py
گزارش بهبود پروژه.docx		گزارش بهبود پروژه.docx
گزارش بهبود پروژه.pdf		گزارش بهبود پروژه.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP Keyword Extraction Project

🎯 Project Overview

✨ Features

🏗️ Project Structure

🚀 Installation

Prerequisites

Dependencies

Setup

📖 Usage

Basic Usage

Output Format

🔧 Configuration

🧠 Algorithm Details

TextRank Implementation

Supported POS Tags

📊 Example Results

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Support

About

Uh oh!

Languages

ghorbani-mohammad/nlp-keyword-extraction

Folders and files

Latest commit

History

Repository files navigation

NLP Keyword Extraction Project

🎯 Project Overview

✨ Features

🏗️ Project Structure

🚀 Installation

Prerequisites

Dependencies

Setup

📖 Usage

Basic Usage

Output Format

🔧 Configuration

🧠 Algorithm Details

TextRank Implementation

Supported POS Tags

📊 Example Results

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages