Complete NLP Tutorial: Introduction to NLP & Text Preprocessing
This tutorial covers Natural Language Processing (NLP) fundamentals and text preprocessing techniques with Python code examples.
1. What is NLP?
Natural Language Processing (NLP) is a branch of AI that enables computers to understand, interpret, and generate human language.
Key Applications
Application | Description | Example |
---|---|---|
Chatbots | AI conversational agents | ChatGPT, Google Bard |
Machine Translation | Text translation between languages | Google Translate |
Sentiment Analysis | Detecting emotions in text | Twitter sentiment analysis |
Named Entity Recognition (NER) | Identifying names, places, dates | Extracting "Apple" as a company |
Text Summarization | Condensing long documents | News article summarization |
2. Text Preprocessing
Raw text data must be cleaned and normalized before NLP tasks.
Key Steps in Text Preprocessing
- Tokenization
- Stemming & Lemmatization
- Stopword Removal
- Regex Cleaning
3. Tokenization
Splitting text into words, sentences, or subwords.
Methods
Method | Library | Use Case |
---|---|---|
Word Tokenization | nltk.word_tokenize() |
Splitting sentences into words |
Sentence Tokenization | nltk.sent_tokenize() |
Splitting paragraphs into sentences |
Subword Tokenization | Hugging Face Tokenizer |
Handling rare words (e.g., "unhappiness" → "un", "happiness") |
Example: Word & Sentence Tokenization
import nltk
nltk.download('punkt')
text = "NLP is amazing! It helps computers understand language."
# Word Tokenization
words = nltk.word_tokenize(text)
print("Word Tokens:", words)
# Output: ['NLP', 'is', 'amazing', '!', 'It', 'helps', 'computers', 'understand', 'language', '.']
# Sentence Tokenization
sentences = nltk.sent_tokenize(text)
print("Sentence Tokens:", sentences)
# Output: ['NLP is amazing!', 'It helps computers understand language.']
4. Stemming vs. Lemmatization
Both reduce words to their base form, but lemmatization is more accurate.
Method | Example (Input → Output) | Library |
---|---|---|
Stemming | "running" → "run" | PorterStemmer , SnowballStemmer |
Lemmatization | "better" → "good" | WordNetLemmatizer (requires POS tag) |
Example: Stemming & Lemmatization
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')
text = "running runs ran better"
# Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in text.split()]
print("Stemmed:", stemmed)
# Output: ['run', 'run', 'ran', 'better']
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word, pos='v') for word in text.split()] # 'v' for verb
print("Lemmatized:", lemmatized)
# Output: ['run', 'run', 'run', 'better']
5. Stopword Removal
Stopwords (e.g., "the", "is", "and") add noise and are often removed.
Example: Removing Stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
text = "This is an example sentence showing off stopword filtration."
tokens = word_tokenize(text.lower())
stop_words = set(stopwords.words('english'))
filtered = [word for word in tokens if word not in stop_words and word.isalpha()]
print("Filtered:", filtered)
# Output: ['example', 'sentence', 'showing', 'stopword', 'filtration']
6. Regex Cleaning
Removing unwanted characters (URLs, emails, punctuation).
Example: Cleaning Text with Regex
import re
text = "Check out https://example.com! Email me at user@email.com."
# Remove URLs
cleaned = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
# Remove emails
cleaned = re.sub(r'\S+@\S+', '', cleaned)
# Remove punctuation
cleaned = re.sub(r'[^\w\s]', '', cleaned)
print("Cleaned Text:", cleaned)
# Output: "Check out Email me at "
7. Full Text Preprocessing Pipeline
Combining all steps for clean text:
def preprocess_text(text):
# Lowercase
text = text.lower()
# Remove URLs, emails
text = re.sub(r'http\S+|www\S+|https\S+|\S+@\S+', '', text)
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]
return " ".join(tokens)
text = "NLP is awesome! Check https://nlp.org for more info."
print("Processed:", preprocess_text(text))
# Output: "nlp awesome check info"
8. Libraries Comparison
Task | NLTK | spaCy | TextBlob |
---|---|---|---|
Tokenization | ✅ | ✅ (faster) | ✅ |
Lemmatization | ✅ (needs POS) | ✅ (automatic POS) | ✅ |
Stopwords | ✅ | ✅ | ✅ |
Sentiment Analysis | ❌ | ❌ | ✅ |
Example: spaCy for Faster Processing
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp(text)
# Extract tokens, lemmas, and POS tags
for token in doc:
print(token.text, token.lemma_, token.pos_)
Summary
✅ NLP enables machines to work with human language.
✅ Text Preprocessing includes tokenization, lemmatization, stopword removal, and regex cleaning.
✅ NLTK is great for learning, spaCy for production, and TextBlob for quick sentiment analysis.
Next Steps:
➡️ Try these techniques on real datasets (e.g., Twitter data).
➡️ Explore feature extraction (TF-IDF, Word2Vec).
Would you like a tutorial on Feature Engineering for NLP next? 🚀