Text Analysis (NLP Basics)

Machine Learning Fundamentals with Python

2 min read

Published Nov 16 2025

ClusteringImagesK-MeansLinear RegressionLogistic RegressionMachine LearningNeural NetworksNLPNumPyPythonRandom Forestsscikit-learnSupervised LearningUnsupervised Learning

Text Analysis (or NLP — Natural Language Processing) is a branch of ML that focuses on understanding, processing, and extracting insights from human language.

Applications include:

Sentiment analysis (positive/negative reviews)
Spam detection
Chatbots and language translation
Topic modelling and keyword extraction

Text data is unstructured, so before we can use it in ML models, we must clean, tokenise, and vectorise it (convert words into numbers).

Loading and Inspecting Text Data

We’ll use a small sample dataset of product reviews to illustrate.

import pandas as pd

# Sample text dataset

data = {

'review': [

"This phone is amazing, battery lasts forever!",

"Terrible service and the phone stopped working.",

"Love the camera quality, very satisfied.",

"Horrible experience, would not recommend.",

"Decent phone for the price, could be better."

'sentiment': ['positive', 'negative', 'positive', 'negative', 'neutral']

}

df = pd.DataFrame(data)

print(df)

Explanation:

Each row is a text review.
The sentiment column contains the label we want to predict (positive/negative/neutral).

Text Preprocessing

Before text can be analysed, we clean and prepare it.

Typical steps include:

Lowercasing
Removing punctuation and stopwords (common words like “the”, “is”)
Tokenisation (splitting text into words)
Lemmatisation or stemming (reducing words to base form e.g. "running", "ran", "runs" changed to "run"))

Let’s clean the text using Cleantext:

from cleantext import clean

# Use TextHero to clean text

df["clean_review"] = df["review"].apply(

lambda x: clean(

lower=True,

no_punct=True,

no_numbers=True,

no_urls=True,

no_emails=True,

no_line_breaks=True,

no_emoji=True,

no_digits=True

)

print(df[['review', 'clean_review']])

Output:

review clean_review

0 This phone is amazing, battery lasts forever! this phone is amazing battery lasts forever

1 Terrible service and the phone stopped working. terrible service and the phone stopped working

2 Love the camera quality, very satisfied. love the camera quality very satisfied

3 Horrible experience, would not recommend. horrible experience would not recommend

4 Decent phone for the price, could be better. decent phone for the price could be better

Vectorising Text (Turning Words into Numbers)

ML models need numbers, not words.
We can use Bag of Words (BoW) or TF-IDF (Term Frequency–Inverse Document Frequency) to represent text numerically.

Example: Using Scikit-learn’s CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df['clean_review'])

print("Feature names (vocabulary):")

print(vectorizer.get_feature_names_out())

print("\nSparse matrix shape:", X.shape)

Output:

Feature names (vocabulary):

['amazing' 'and' 'battery' 'be' 'better' 'camera' 'could' 'decent'

'experience' 'for' 'forever' 'horrible' 'is' 'lasts' 'love' 'not' 'phone'

'price' 'quality' 'recommend' 'satisfied' 'service' 'stopped' 'terrible'

'the' 'this' 'very' 'working' 'would']

Sparse matrix shape: (5, 29)

Explanation:

Each unique word becomes a column.
Each row represents one document (review).
The matrix values count how often each word appears.

Example: Using TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X_tfidf = tfidf.fit_transform(df['clean_review'])

print("TF-IDF shape:", X_tfidf.shape)

TF-IDF gives more importance to words that are unique to a document (and less to common words like “phone”).

Visualising Text with Word Clouds

Word clouds are a way to visualise the most frequent words in a corpus.

from wordcloud import WordCloud

import matplotlib.pyplot as plt

# Combine all text into one string

text = " ".join(df['clean_review'])

# Generate and display word cloud

wordcloud = WordCloud(width=600, height=300, background_color='white').generate(text)

plt.figure(figsize=(8,4))

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis("off")

plt.title("Word Cloud of Reviews")

plt.show()

machine learning fundamentals text analysis wordcloud

Explanation:

The more frequent a word, the larger it appears in the cloud.
Great for quick qualitative insights.

Sentiment Analysis Example

Let’s classify text sentiment (positive or negative) using a simple model.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, classification_report

# Simplify dataset: only positive and negative

df_binary = df[df['sentiment'] != 'neutral']

# Vectorise text

X = tfidf.fit_transform(df_binary['clean_review'])

y = df_binary['sentiment']

# Split and train

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.3, random_state=42, stratify=y

)

model = LogisticRegression()

model.fit(X_train, y_train)

# Predict

y_pred = model.predict(X_test)

print("Predictions:", y_pred)

print("Accuracy:", accuracy_score(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred))

Explanation:

We used a Logistic Regression model for classification.
The TF-IDF features serve as inputs.
We evaluate performance on unseen test data.
Needs a larger dataset to be more useful

Common NLP Concepts

Tokenisation - Splitting text into words or phrases
Stopwords - Common words to remove (e.g., “is”, “the”)
Lemmatisation - Reducing words to base form (“running” → “run”)
Bag of Words - Representing text by word frequency
TF-IDF - Weighted version of Bag of Words
Word Embeddings - Dense vector representations (Word2Vec, GloVe, BERT)

Text Analysis (NLP Basics)

Machine Learning Fundamentals with Python

2 min read

Published Nov 16 2025

Guide Sections

Guide Comments

Loading and Inspecting Text Data

Text Preprocessing

Vectorising Text (Turning Words into Numbers)

Visualising Text with Word Clouds

Sentiment Analysis Example

Common NLP Concepts

Products from our shop

Docker Cheat Sheet - Print at Home Designs

Docker Cheat Sheet Mouse Mat

Docker Cheat Sheet Travel Mug

Docker Cheat Sheet Mug

Vim Cheat Sheet - Print at Home Designs

Vim Cheat Sheet Mouse Mat

Vim Cheat Sheet Travel Mug

Vim Cheat Sheet Mug

SimpleSteps.guide branded Travel Mug

Developer Excuse Javascript - Travel Mug

Developer Excuse Javascript Embroidered T-Shirt - Dark

Developer Excuse Javascript Embroidered T-Shirt - Light

Developer Excuse Javascript Mug - White

Developer Excuse Javascript Mug - Black

SimpleSteps.guide branded stainless steel water bottle

Developer Excuse Javascript Hoodie - Light

Developer Excuse Javascript Hoodie - Dark