Text Analysis (NLP Basics)
Machine Learning Fundamentals with Python
2 min read
Published Nov 16 2025
Guide Sections
Guide Comments
Text Analysis (or NLP — Natural Language Processing) is a branch of ML that focuses on understanding, processing, and extracting insights from human language.
Applications include:
- Sentiment analysis (positive/negative reviews)
- Spam detection
- Chatbots and language translation
- Topic modelling and keyword extraction
Text data is unstructured, so before we can use it in ML models, we must clean, tokenise, and vectorise it (convert words into numbers).
Loading and Inspecting Text Data
We’ll use a small sample dataset of product reviews to illustrate.
Explanation:
- Each row is a text review.
- The
sentimentcolumn contains the label we want to predict (positive/negative/neutral).
Text Preprocessing
Before text can be analysed, we clean and prepare it.
Typical steps include:
- Lowercasing
- Removing punctuation and stopwords (common words like “the”, “is”)
- Tokenisation (splitting text into words)
- Lemmatisation or stemming (reducing words to base form e.g. "running", "ran", "runs" changed to "run"))
Let’s clean the text using Cleantext:
Output:
Vectorising Text (Turning Words into Numbers)
ML models need numbers, not words.
We can use Bag of Words (BoW) or TF-IDF (Term Frequency–Inverse Document Frequency) to represent text numerically.
Example: Using Scikit-learn’s CountVectorizer
Output:
Explanation:
- Each unique word becomes a column.
- Each row represents one document (review).
- The matrix values count how often each word appears.
Example: Using TfidfVectorizer
TF-IDF gives more importance to words that are unique to a document (and less to common words like “phone”).
Visualising Text with Word Clouds
Word clouds are a way to visualise the most frequent words in a corpus.

Explanation:
- The more frequent a word, the larger it appears in the cloud.
- Great for quick qualitative insights.
Sentiment Analysis Example
Let’s classify text sentiment (positive or negative) using a simple model.
Explanation:
- We used a Logistic Regression model for classification.
- The TF-IDF features serve as inputs.
- We evaluate performance on unseen test data.
- Needs a larger dataset to be more useful
Common NLP Concepts
Tokenisation- Splitting text into words or phrasesStopwords- Common words to remove (e.g., “is”, “the”)Lemmatisation- Reducing words to base form (“running” → “run”)Bag of Words- Representing text by word frequencyTF-IDF- Weighted version of Bag of WordsWord Embeddings- Dense vector representations (Word2Vec, GloVe, BERT)














