Text Analysis (NLP Basics)

Machine Learning Fundamentals with Python

2 min read

Published Nov 16 2025


10
0
0
0

ClusteringImagesK-MeansLinear RegressionLogistic RegressionMachine LearningNeural NetworksNLPNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

Text Analysis (or NLP — Natural Language Processing) is a branch of ML that focuses on understanding, processing, and extracting insights from human language.


Applications include:

  • Sentiment analysis (positive/negative reviews)
  • Spam detection
  • Chatbots and language translation
  • Topic modelling and keyword extraction

Text data is unstructured, so before we can use it in ML models, we must clean, tokenise, and vectorise it (convert words into numbers).




Loading and Inspecting Text Data

We’ll use a small sample dataset of product reviews to illustrate.

import pandas as pd

# Sample text dataset
data = {
    'review': [
        "This phone is amazing, battery lasts forever!",
        "Terrible service and the phone stopped working.",
        "Love the camera quality, very satisfied.",
        "Horrible experience, would not recommend.",
        "Decent phone for the price, could be better."
    ],
    'sentiment': ['positive', 'negative', 'positive', 'negative', 'neutral']
}

df = pd.DataFrame(data)
print(df)

Explanation:

  • Each row is a text review.
  • The sentiment column contains the label we want to predict (positive/negative/neutral).





Text Preprocessing

Before text can be analysed, we clean and prepare it.


Typical steps include:

  • Lowercasing
  • Removing punctuation and stopwords (common words like “the”, “is”)
  • Tokenisation (splitting text into words)
  • Lemmatisation or stemming (reducing words to base form e.g. "running", "ran", "runs" changed to "run"))

Let’s clean the text using Cleantext:

from cleantext import clean

# Use TextHero to clean text
df["clean_review"] = df["review"].apply(
    lambda x: clean(
        x,
        lower=True,
        no_punct=True,
        no_numbers=True,
        no_urls=True,
        no_emails=True,
        no_line_breaks=True,
        no_emoji=True,
        no_digits=True
    )
)
print(df[['review', 'clean_review']])

Output:

                                            review clean_review
0 This phone is amazing, battery lasts forever! this phone is amazing battery lasts forever
1 Terrible service and the phone stopped working. terrible service and the phone stopped working
2 Love the camera quality, very satisfied. love the camera quality very satisfied
3 Horrible experience, would not recommend. horrible experience would not recommend
4 Decent phone for the price, could be better. decent phone for the price could be better





Vectorising Text (Turning Words into Numbers)

ML models need numbers, not words.
We can use Bag of Words (BoW) or TF-IDF (Term Frequency–Inverse Document Frequency) to represent text numerically.


Example: Using Scikit-learn’s CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['clean_review'])

print("Feature names (vocabulary):")
print(vectorizer.get_feature_names_out())

print("\nSparse matrix shape:", X.shape)

Output:

Feature names (vocabulary):
['amazing' 'and' 'battery' 'be' 'better' 'camera' 'could' 'decent'
 'experience' 'for' 'forever' 'horrible' 'is' 'lasts' 'love' 'not' 'phone'
 'price' 'quality' 'recommend' 'satisfied' 'service' 'stopped' 'terrible'
 'the' 'this' 'very' 'working' 'would']

Sparse matrix shape: (5, 29)

Explanation:

  • Each unique word becomes a column.
  • Each row represents one document (review).
  • The matrix values count how often each word appears.

Example: Using TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['clean_review'])

print("TF-IDF shape:", X_tfidf.shape)

TF-IDF gives more importance to words that are unique to a document (and less to common words like “phone”).






Visualising Text with Word Clouds

Word clouds are a way to visualise the most frequent words in a corpus.

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Combine all text into one string
text = " ".join(df['clean_review'])

# Generate and display word cloud
wordcloud = WordCloud(width=600, height=300, background_color='white').generate(text)

plt.figure(figsize=(8,4))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Word Cloud of Reviews")
plt.show()

machine learning fundamentals text analysis wordcloud

Explanation:

  • The more frequent a word, the larger it appears in the cloud.
  • Great for quick qualitative insights.





Sentiment Analysis Example

Let’s classify text sentiment (positive or negative) using a simple model.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Simplify dataset: only positive and negative
df_binary = df[df['sentiment'] != 'neutral']

# Vectorise text
X = tfidf.fit_transform(df_binary['clean_review'])
y = df_binary['sentiment']

# Split and train
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

print("Predictions:", y_pred)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Explanation:

  • We used a Logistic Regression model for classification.
  • The TF-IDF features serve as inputs.
  • We evaluate performance on unseen test data.
  • Needs a larger dataset to be more useful





Common NLP Concepts

  • Tokenisation - Splitting text into words or phrases
  • Stopwords - Common words to remove (e.g., “is”, “the”)
  • Lemmatisation - Reducing words to base form (“running” → “run”)
  • Bag of Words - Representing text by word frequency
  • TF-IDF - Weighted version of Bag of Words
  • Word Embeddings - Dense vector representations (Word2Vec, GloVe, BERT)

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark

Developer Excuse Javascript Hoodie - Dark

© 2025 SimpleSteps.guide
AboutFAQPoliciesContact