Part 4 – The Language of Machines: Understanding NLP in the Modern World (An In-Depth NLP Series)
Part 4: Vectorization and Embedding Techniques
Back again with more NLP goodness. Up until now, we have discovered the basic levels of Natural Language Processing: Natural Language Understanding and Natural Language Generation; all the possible essential data preprocessing would include tokenization, normalization, stop word removal, emoji, and URL filtering, and even intelligent procedures such as stemming and text augmentation. Text Representation techniques such as Terminal Frequency, Document Frequency, Treebank Annotation, Dependency Parsing, or Simple Word Clouds are all examples of how text can be analyzed and given structure to it. Not to mention, we discussed the fascinating Zipf’s Law, which mathematically describes regularity in the repetitive characters of language patterns.
In Case You Missed It: Previous Parts of Our NLP Journey
- Part 1: The Building Blocks of NLP – Exploring NLU, NLG, and Core Elements
- Part 2: From Raw to Refined – Data Processing
- Part 3: Data Visualization and Text Representation in NLP
Now, let’s take the next step! In this part of the series, we will cover advanced conversion techniques to transform raw texts into numeric formats using various vectorization techniques, including Bag-of-Words, n-grams, One-hot Encoding, TF-IDF, Count Vectorization, and Byte-Pair Encoding (BPE). Next goes into the sensory experience of word embeddings- from classical models, Skip-Gram and CBOW, pre-trained static vectors like Word2Vec and GloVe, to the latest in contextual embedding like BERT and ELMo, which change meaning based on the context.
To visualize and interpret this linguistic intelligence, we are going to employ tools like t-SNE (t-distributed Stochastic Neighbor Embedding) and TextEvaluator, making our models not only smarter but also interpretable.
So, buckle up! This part is math and meaning: machines are starting to understand languages in a way that they have never before been able to. Dive in right away with us.
Vectorization
Do you ever stop to wonder how chatbots manage to decode the human language or how Google ranks the search results based on relevance? The magic happens through vectorization – The method by which text gets converted into numerical data understood and processed by machines. Learn vectorization, and you can achieve far better sentiment analysis, smarter AI assistants, and enhanced search outcomes. Come, let us dive in!
Vectorization is the conversion of textual data to numeric form so that machines can understand and analyze it. Since computers do not “understand” words even remotely, Natural Language Processing (NLP) is replete with techniques which is used to transform words, sentences, and documents into numerical vectors or arrays.
Why Is Vectorization Important?
- Allows text to be processed by machine learning models.
- Text classification, sentiment analysis, chatbot, and search engine.
- Used in recommendation systems and fraud detection, and many times, you may find these AI assistants with them.
Some of the vectorization techniques are BoW, N-Grams, One-Hot Encoding, TF-IDF, Count Vectorization, and BPE to help machines process language efficiently.Let us dig into them one by one!
Bag-of-Words (BoW)
The Bag of Words(Bow) boils down to the easiest NLP vectorization of all: it coerces text into something in which words can be represented. It considers the text to be a tuple of distinct word instances without regard for their order and also maintains how much.
So how does it work?
- Tokenization: Split the text into one word
- Vocabulary Generation: The batch of neologism for whole corpus.
- Vector encoding: It transforms such that each document is a vector generated from its count of occurrence in words.
Example:
I will provide two examples of –
1. “I love NLP.”
2. “NLP is good.”
| I | Love | NLP | Is | Amazing | |
| S1 | 1 | 1 | 1 | 0 | 0 |
| S2 | 0 | 0 | 1 | 1 | 1 |
N-Gram
An N-gram is the sequence of N words or characters in a particular text consecutively. This helps to capture context and phrase structures, which single-word models (such as BoW) fail to do.
Example:
Sentence: I love NLP techniques
N-Grams:
Unigram (1-Gram): “I”, “love”, “NLP”,” techniques”
Bigram (2-Gram): I love, love NLP, NLP techniques
Why N-Gram is Great?
- Coroutine local context: Allows for tasks like text prediction (e.g., in autocomplete)
- Search results are more relevant: can be used for retrieval and keyword matching.
- Bigram and trigram models have better insight into the sentiment of the phrase.
One-hot Encoding
One-hot encoding is a vectorization method that turns words into binary vectors. In vectors, each word in the vocabulary has a unique representation with zeros everywhere and just one 1, representing it for each position.
Where is One-Hot Encoding used?
- Text Classification — For basic NLP
- Voice Recognition- This is for signature at the initial stage of features extraction.
- Basic Neural Networks – Process where they take an input representation
Word embeddings (Word2Vec, GloVe, or BERT for more efficiency) are normally used instead.
Example
Just think: we have a three-word vocabulary
“cat”, “dog”, “fish”;
We map every single word to a different binary vector; only one member is the position equal to one (active), and all others are zero(zero).
One Hot Vector
Cat [1,0, 0]
“Cat” → Cat [1, 0, 0] because the first word!
“Dog” is [0, 1, 0] → position two is “1” because it’s the second word.
So “Fish” is [0, 0, 1] → Position 3 is 1 because it’s gotta be the third word
This gives you a unique representation of every word, except no meaning, context, or relations between them exist.
TF-IDF
TF-IDF (Term Frequency- Inverse document frequency) is a numerical statistic used in analyzing text data for a relative measure of a term’s importance in a document within a corpus (NLP ). As opposed to the raw word count, TF-IDF lowers the score for general words (e.g, “the” and “is”) and promotes uncommon yet significant terms.
Breaking Down TF-IDF
- Term Frequency (TF): Number of times that a word appears in a document.
$$Term Frequency (t) = \frac{Number \,of \,times \,a \,word \,t \,appers \,in \,a \,document}{Total \,number \,of \,words \,in \,a \,document}$$
Example:
In a document of 100 words, if “NLP” appears 5 times, its TF is: 5/100=0.05 - Inverse Document Frequency (IDF): How distinct a word is across various Documents
$$Term Frequency (t) = \log{\frac{Total \,number \,of \,documents \,Number}{Number \,Of \,documents \,containing \,the \,word}}$$
Example:
If the word “machine” appears in 10 out of 1000 documents, its IDF is: IDF=log(1000/10)=log(100)=2 - The final TF-IDF Score is
$$TF−IDF=TF×IDF$$
TF-IDF=0.05×2=0.1
CountVec (Count Vectorization)
Count Vectorization (abbreviated as CountVec) is perhaps the simplest yet fundamental method of vectorization in Natural Language Processing (NLP). It converts text into a matrix of token counts upon which machine learning models can act and make sense of text in number form. Count vectorization is heavily used in text classification, spam detection, sentiment analysis, and topic modeling.
Essentially, Count Vectorization creates a vocabulary of all the unique words in a given corpus (a collection of documents). It counts the occurrences of each word in each document by going through each document in the collection. The result is a sparse matrix having:
- Rows that correspond to individual documents
- Columns that contain words in the vocab
- Cell values store the frequency of words in that document.
Considering a very simple example of the following two sentences:
“I love NLP.”
“NLP is fun”
The vocabulary becomes: [“I”, “love”, “NLP”, “is”, “fun”]
The document vectors would be:
Document 1 → [1, 1, 1, 0, 0]
Document 2 → [0, 0, 1, 1, 1]
This matrix represents the Count Vectorization representation of the documents.
Byte-Pair Encoding (BPE)
Byte-Pair Encoding (BPE): One of the most effective subword tokenization methods in high-performance modern Natural Language Processing (NLP) has developed strength in addressing an important challenge in the language modeling processes: more efficient representation of very rare or unknown words. Instead of treating a word as an isolated token (which would lead to huge vocabularies with a lot of out-of-vocabulary issues), BPE tokenizes all words into smaller, manageable tokens, allowing models to operate with a limited vocabulary range while comprehending a wider range of words.
What do you mean by BPE?
It treats every character in a word as an independent symbol. It merges the most frequently appearing adjacent symbols (this is why the word “byte-pair” is also in the name). This repetition of work involves several iterations whereby common combinations of letters will be sub-worded into units.
For example, using a word list: low, lowest, lower
1. Characters: l o w, l o w e s t, l o w e r
2. Find the most frequent pair: l o –> merge to lo
3. Repeat: lo w, lo we, etc.
Traditional Word Embeddings
Unlike classic vectorization methods such as one-hot encoding or bag-of-words that failed to capture meaning in any contextual or dimensionally efficient way, traditional embeddings like Word2Vec, GloVe, and FastText work by mapping words into a continuous vector space, where similar words are placed together closer. Indeed, the embeddings facilitate understanding relationships like “king”-“man”+ “woman” ≈”queen,” thereby bringing an additional layer of linguistic and contextual understanding.
As with conventional approaches, words in traditional word embeddings are represented as vectors in a continuous space in such a way that the most similar words are located closest to one another. Unlike One-Hot Encoding or Bag-of-Words, which operate on an independence assumption about words and have no knowledge of context or relation among words, traditional word embeddings preserve meaning, context, and relations among words based on the corpora of texts available. Let us delve into Skip-Gram, CBOW, and pre-trained word embeddings, which are foundation stones of NLP that learn meaningful vector representations of words from large text corpora. In simplistic terms, this means they analyze contextual relationships between words.
Skip-Gram
Skip-Gram is the most well-known algorithm in Natural Language Processing (NLP) learning to learn word embeddings, which are the compact vector representations of words. Part of the Word2Vec architecture originally proposed by Google, words in the surrounding context of a target word in a particular sentence are referred to as context words, and they are predicted by the Skip-Gram model. The main idea is that words used in similar contexts are bound to have similar meanings, and therefore, this idea is used in the model for learning embeddings of the words in a way that preserves semantic relationships.
To understand Skip-Gram better, let’s explain it with a more elaborate example. Take the sentence “Artificial Intelligence is Transforming Industries”. In case we select the word “intelligence” to be the center word while setting the skip window to 2, Skip-Gram will try to predict “Artificial” and “is” as the two preceding and following words. The model will create pairs such as (intelligence → Artificial) (intelligence → is). This is done for every word in the text, allowing the model to learn how words are used alongside each other progressively. In due course, the model creates a representation by which words in closely related contexts like “intelligence,” “machine,” and “learning” have the same vector representation, and they are grouped with one vector space. This allows the NLP systems to learn the relations between different words and improves the accuracy of the embedding models for rare words.
CBOW (Continuous Bag of Words)
CBOW is indeed a famous kind of artificial network that is popularly called the neural net in learning word embeddings. It came to the fore as part of Word2Vec, whereby a word is predicted in the context of its environment. Given words preceding or following a missing entry in a sentence, the CBOW model can guess that missing word.
For example: I have a case study in which I would say his sentence as “Artificial Intelligence is transforming industries”; as I take “intelligence” as the target word with the context window of “2,” my context words would be “Artificial,” “is,” “transforming,” and “industries.” In other words, the CBOW model accepts these context words as inputs and tries to predict, in this case, the word intelligence. This is repeated over large text corpora to learn patterns for how words are used within contexts to build meaningful vector representations of those words.
Pre-trained word Embeddings (e.g., Word2Vec, GloVe)
Pre-trained word embeddings are word vector representations that are ready for use; they have been trained comprehensively on huge text corpora and have mostly been exploited to improve the performance of any NLP model. Instead of deriving the word embedding from scratch for every new task, these pre-trained models (for example, Word2Vec, GloVe, and FastText) can be recognized as they have captured various semantic relationships between words.
For example, an analogy indicating that the “king” appears to be related to “queen” in the same manner as “man” appears to “woman” is captured in vector arithmetic in the pre-trained Word2Vec model from Google on news content: king – man + woman ≈ queen. Similarly, GloVe (Global Vectors for Word Representation), developed by Stanford, aims to produce dense, high-quality word vectors obtained from a global co-occurrence statistic of the entire corpora.
Contextual Word Embeddings
Contextual word embeddings are the latest approaches that an NLP will offer as enhanced word representations, the salient meaning of which is inferred by the context of its sentence. Unlike traditional embeddings, for example, Word2Vec or GloVe, where each word has a single vector fixed, these generate vectors for the same word that could be used in totally different contexts. For instance, the word “bank” would have a certain meaning in “river bank” but would have an entirely different one in “savings bank”-and models could identify such nuances like BERT, ELMo, or even GPT.
These embeddings are generated via deep neural networks frequently attuned to transformer architectures, thus reading the whole sentence or even an entire document to contextualize words. Hence, more context-sensitive or a fuller reflection of the understanding of language, especially in ambiguity or involving task sentiments, or an understanding of syntax, would be achieved. These can be seen throughout cutting-edge NLP applications, such as in question answering, translation, and conversational AI, and they also could be part of the reason for the triumph that has been witnessed in models such as ChatGPT and Google BERT.
Sentence-Document Embedding
Sentence-Document Embedding refers to single vectors capturing a sentence or a document’s meaning without deconstruction on a word basis. These embeddings enable machines to interpret not just the words but the meaning each word gives to the sentence or paragraph as a whole.
Let’s take the example of the following two sentences:
“Today the weather is sunny.”
“Today, it is a bright, clear day.”
Even though the words are different in both cases, the meaning conveyed is the same. Sentence/document embedding models such as Sentence-BERT will create vector representations of the two that are situated tightly in the embedding space which will benefit meaning-based search, duplicate detection, and document grouping to identify that these two are similar in meaning.
Pre-trained word Embeddings (Contextual models like BERT, ELMo)
Pre-trained word Embeddings like BERT and ELMo exemplify models where the word’s role in the sentence changes its embedding. Unlike Word2Vec, which uses a static approach, these modern models are contextually richer and, thus, superior.
Example:
Bass is used in the following contexts:
“He played the bass guitar at the concert.”
“They caught a bass in the lake.”
In static word embeddings, “Bass” gives high and low embeddings in both sentences, which could mislead the model. However, with BERT or ELMo, each sentence will yield a corresponding and distinct embedding occurring with music or with fish respectively, which is facilitated by the model’s insight. With these models, completing advanced tasks in NLP, such as question answering, text summarization, or language inference, is simple, where meaning is fluid and changes constantly.
Data Visualization
As already stated, NLP text or number data can be turned into complex diagrams, charts, word clouds, or even embeddings. Considering how challenging and intensive the extraction of information is, data presentation enables researchers and developers to decipher the information concealed in the data easily. In the case of NLP, the application of high-word dimensional features such as word embedding, sentence embedding, or even classification results makes the use of NLP highly advantageous. For example, showing word clusters assists in visualizing how certain words are composed of other words having close meanings within the space of meaning. Visual representations like charts can easily be generated through specialized NLP dashboards, matplotlib, seaborn, Plotly, and many more. These representation aids researchers and model interpreters as they switch complex and multi-dimensional data into far more simple and representable data.
t-sne (t-distributed Stochastic Neighboring Embedding)
t-SNE is one of the most powerful approaches for visualizing high-dimensional data, like word or sentence embeddings, in NLP because it allows them to be represented in two or three-dimensional space. It was developed by Geoffrey Hinton and Laurens van der Maaten. t-SNE attempts to place similar objects together and separate dissimilar ones. This enables one to visually identify the clusters, or semantic relationships between, words, phrases, or documents.
Example: While working with the TWord2Vec model, for instance, one can check the relation of the words king, woman, queen, and man. Often, t-SNE plots these embeddings into some cluster stemming out of femininity or royalty. T-SNE is excellent in visualizing high-dimensional data but is not ideal for training models because of its stochastic nature and heavy computational requirements. Nonetheless, it is still the first choice in most NLP projects for interpreting and debugging embedding spaces.
The image is a t-SNE (t-distributed Stochastic Neighbor Embedding) plot that visualizes high-dimensional data (e.g., word embeddings) in two dimensions. The following outlines what it is showing:
X-axis and Y-axis: These axes are representations of dimensions reduced by t-SNE. They possess no individual meaning and are used to spatially separate data points so that their relationships become visually interpretable.
Each Point is a data vector, for instance, a word, sentence, or document from an embedding model.
Clusters: Data points that are close together have semantic similarity. For instance, in a word embedding visualization, words like “cat”, “dog”, and “rabbit” may converge into a cluster reflecting they are related in meaning.
TextEvaluator
TextEvaluator is a contemporary evaluation tool that assesses the quality of text generation models, specifically in machine translation, summarization, paraphrasing, or conversational AI. Different from older metrics like BLEU and ROUGE or METEOR that rely on n-gram matching and have to have close to exact strings, TextEvaluator utilizes broad matching of meaning and contextual relevance, giving it stronger alignment with human judgment. It typically uses transformers such as BERT which has been known to extract the meaning of a generated output to determine whether it meets the purpose behind the reference text provided.
In a summarization task, TextEvaluator can score two summaries the same if they capture the same core message, even if different words are used. This demonstrates why TextEvaluator is important for modern NLP applications that deal with greater creativity and variation in expressions. Such evaluators, alongside TextEvaluator, aid in addressing whether the models being generated perform up to the required standard in being contextually and semantically accurate beyond just technical precision—marking the value of modern developments in natural language generation.
Summary
Shorn of any context, these are the fundamental techniques and models that power intelligent text understanding in NLP. Vectorization in NLP means transforming textual data into a numeric form for the machine to process the language efficiently. Some of the foundational approaches include Bag-of-Words, n-grams, One-hot Encoding, TF-IDF, and Count Vectorization. Each of them converts words or phrases into vector space- perhaps the frequency or position of their parameters. After that comes the more advanced stuf,f such as Byte-Pair Encoding (BPE), which takes care of rare-word segmentation. Traditional Word Embeddings such as Skip-Gram and CBOW learn semantic relationships from context, while Pre-Trained Embeddings such as Word2Vec and GloVe already bring pre-learned knowledge to the table. With contextual meaning from surrounding words, Conditions around which BERT and ELMo would add further context into Sentence and Document Embeddings for larger units of text. For interpretation and analysis, t-SNE, TextEvaluator, and similar tools will visualize in reduced dimensions and assess these embeddings’ quality and clustering, turning data insight into action.
Are you still looking for some info? Then hang on and learn some bite-size Log production of Numpty Neuron in NLP, AI, and language and cognition! Stay Curious! Stay Numpty!
💬 Join the Conversation
📖 NLP Blog Series
The NLP Blog Series is a structured, multi-part journey into the fascinating world of Natural Language Processing (NLP) — the field of AI that enables computers to understand, interpret, and generate human language. From the basics of text processing and machine learning techniques to advanced applications like chatbots, sentiment analysis, and large language models, this series breaks down complex concepts into easy-to-follow, practical guides. Whether you are a beginner curious about how machines understand text, a student diving into AI, or a professional exploring NLP applications, this series is designed to give you a step-by-step foundation along with real-world insights.
Part 5 of 9
No comments yet. Be the first to share your thoughts!