Python | Extract words from given string
Last Updated :
11 Jul, 2025
Extracting words from a given string refers to identifying and separating individual words from a block of text or sentence. This is a common task in text processing, searching, filtering or analyzing content.
Example: Here, each word is extracted from a given string.
Input: GeeksForGeeks is the best Computer Science Portal
Output: ['GeeksForGeeks', 'is', 'the', 'best', 'Computer', 'Science', 'Portal']
Python provides different methods to extract words from a string. Let’s explore them one by one.
Using Split()
split() method splits the string at spaces (or a specified delimiter) and returns a list of individual words. However, it does not remove punctuation marks, so it may stay attached to the words.
Example:
In this Example, split() method is used to extract individual words from a given string.
Python
Str = "Python is a powerful and versatile programming language"
print(Str.split())
Output['Python', 'is', 'a', 'powerful', 'and', 'versatile', 'programming', 'language']
Explanation: split() is used to extract each word from Str and it seperates the words based on spaces and return them as a list.
Using Regex
Regular expressions allow extracting words from text with punctuation or special characters. Python’s re.findall() helps filter out only the valid words.
Example:
This program uses re.findall() method from Python’s re module to extract words from a string.
Python
import re
Str = "Python, is widely-used @# for Data Science and AI.!!!"
T = re.findall(r'\w+', Str)
print(T)
Output['Python', 'is', 'widely', 'used', 'for', 'Data', 'Science', 'and', 'AI']
Explanation: re.findall(r'\w+', Str) extracts all sequences of letters, digits and underscores effectively skipping punctuation and special characters.
Using List Comprehension
List comprehension allows filtering out punctuation by combining with functions like strip() and isalnum() helping to collect clean and valid words in a compact way.
Example:
Here, list comprehension is used along with string.punctuation and isalnum() methods to extract clean words from a string.
Python
import string
Str = "Python, is simple @# yet powerful Programming Language.!!!"
T = [w.strip(string.punctuation) for w in Str.split() if w.strip(string.punctuation).isalnum()]
print(T)
Output['Python', 'is', 'simple', 'yet', 'powerful', 'Programming', 'Language']
Explanation:
- Str.split() splits the string into words.
- w.strip(string.punctuation) removes punctuation from each word.
- isalnum() ensures only alphanumeric words are kept.
Using Regex() + String.Punctuation
Regular expressions combined with Python’s string.punctuation to remove all punctuation marks from a string before extracting words. It's useful when text contains various special characters ensuring cleaner word extraction.
Example:
This code extracts words from a string by removing all punctuation using regular expressions.
Python
import re
import string
Str = "Python, is simple @# yet powerful Programming Language.!!!"
a = "[" + re.escape(string.punctuation) + "]"
T = re.sub(a, "", Str).split()
print(T)
Output['Python', 'is', 'simple', 'yet', 'powerful', 'Programming', 'Language']
Explanation:
- re.escape(string.punctuation) safely escapes all punctuation characters and a = "[" + ... + "]" creates a regex pattern to match them.
- re.sub(a, "", Str) removes all punctuation from the string.
- .split() splits cleaned string into individual words.
Using NLP Libraries
Natural Language Processing (NLP) libraries like NLTK provide powerful tools for text analysis. When extracting words from a string they offer more accuracy by properly handling punctuation, contractions and tokenization making them ideal for complex or real-world text data.
Example:
This program demonstrates how to extract words from a string using NLTK's word_tokenize() function.
Python
import nltk
string = "Python is easy-to-learn, powerful and widely used in tech!"
words = nltk.word_tokenize(string)
print(words)
Output
['Python', 'is', 'easy-to-learn', ',', 'powerful', 'and', 'widely', 'used', 'in', 'tech', '!']
Explanation:
- nltk.word_tokenize() splits the string into tokens (words and punctuation).
- It preserves punctuation as separate tokens: ',' and '!'
- It treats hyphen words like "easy-to-learn" as one token.
Related Articles: