You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Text cluster is a normal preprocess procedure to analysis text feature. This project implements a memory friendly method only for short text cluster. For long text, it is preferable to choose SimHash or LDA or others according to demand.
Requirements
pip install tqdm spacy
Usage
Clustering
python cluster.py --infile ./data/infile_en \
--output ./data/output \
--lang en
For more configure arguments description, see _get_parser() in cluster.py, including stop words setting, sample number.
Search
Basic Idea
File Structure
TextCluster
| README.md
| LICENSE
| cluster.py clustering function
| search.py search function
|
|------utils utilities
| | __init__.py
| | segmentor.py tokenizer wrapper
| | similar.py similarity calculator
| | utils.py file process module
|
|------data
| | infile default input file path, to test Chinese mode
| | infile_en default input file path, to test English mode
| | seg_dict default tokenizer dict path
| | stop_words default stop words path
Other Language
For other specific language, modify tokenizer wrapper in ./utils/segmentor.py.