TF-IDF Word Vectorization

(1) What Is TF-IDF Word Vectorization?

TF-IDF (Term Frequency-Inverse Document Frequency) is a widely used word vectorization technique in natural language processing (NLP).
It represents text data as numerical vectors, which can be used as input for various machine learning algorithms.

(2) Term Frequency

The term frequency is a simple count of how many times a word appears in a document.
It is calculated as the number of times a word appears in a document divided by the total number of words in that document.

(3) Inverse Document Frequency

The inverse document frequency is a measure of how important a word is across all documents.
It is calculated as the logarithm of the total number of documents divided by the number of documents containing the word.

(4) TF-IDF Calculation

The TF-IDF value for a word in a document is calculated by multiplying the term frequency (TF) by the inverse document frequency (IDF):

(5) Steps For TF-IDF Calculation

Step 1: Create a vocabulary of all unique words in the corpus (collection of documents).
Step 2: For each document, calculate the TF-IDF value for each word in the vocabulary.
Step 3: Represent each document as a vector, where each element in the vector corresponds to the TF-IDF value of the corresponding word in the vocabulary.

(6) Example Of TF-IDF Calculation

Suppose we have the following three documents:

Document 1: “The quick brown fox jumps over the lazy dog.”
Document 2: “The dog barks at the cat.”
Document 3: “The cat meows softly.”

Step-1: Create The Vocabulary From The Corpus.

Vocabulary = ["the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog", "barks", "at", "cat", "meows", "softly"]

Step-2: Calculate The Term Frequency (TF) For Each Word In Each Document.

Document-1: “The quick brown fox jumps over the lazy dog.”

“the” appears 2 times, so TF(“the”) = 2 / 9 = 0.222
“quick” appears 1 time, so TF(“quick”) = 1 / 9 = 0.111
“brown” appears 1 time, so TF(“brown”) = 1 / 9 = 0.111
“fox” appears 1 time, so TF(“fox”) = 1 / 9 = 0.111
“jumps” appears 1 time, so TF(“jumps”) = 1 / 9 = 0.111
“over” appears 1 time, so TF(“over”) = 1 / 9 = 0.111
“lazy” appears 1 time, so TF(“lazy”) = 1 / 9 = 0.111
“dog” appears 1 time, so TF(“dog”) = 1 / 9 = 0.111

Document-2: “The dog barks at the cat.”

“the” appears 2 times, so TF(“the”) = 2 / 6 = 0.333
“dog” appears 1 time, so TF(“dog”) = 1 / 6 = 0.167
“barks” appears 1 time, so TF(“barks”) = 1 / 6 = 0.167
“at” appears 1 time, so TF(“at”) = 1 / 6 = 0.167
“cat” appears 1 time, so TF(“cat”) = 1 / 6 = 0.167

Document-3: “The cat meows softly.”

“the” appears 1 time, so TF(“the”) = 1 / 4 = 0.25
“cat” appears 1 time, so TF(“cat”) = 1 / 4 = 0.25
“meows” appears 1 time, so TF(“meows”) = 1 / 4 = 0.25
“softly” appears 1 time, so TF(“softly”) = 1 / 4 = 0.25

Step-3: Calculate The Inverse Document Frequency (TF) For Each Word.

The total number of documents is 3.

IDF(“the”) = log(3 / 3) = 0
IDF(“quick”) = log(3 / 1) = 0.477
IDF(“brown”) = log(3 / 1) = 0.477
IDF(“fox”) = log(3 / 1) = 0.477
IDF(“jumps”) = log(3 / 1) = 0.477
IDF(“over”) = log(3 / 1) = 0.477
IDF(“lazy”) = log(3 / 1) = 0.477
IDF(“dog”) = log(3 / 2) = 0.176
IDF(“barks”) = log(3 / 1) = 0.477
IDF(“at”) = log(3 / 1) = 0.477
IDF(“cat”) = log(3 / 2) = 0.176
IDF(“meows”) = log(3 / 1) = 0.477
IDF(“softly”) = log(3 / 1) = 0.477

Here the IDF for the word ‘the’ will be log(3/3) for all three documents.
Because IDF is calculated for all the documents not for the single document.
Like this, we will do for all the words.

Step-4: Calculate the TF-IDF values for each word in each document.

We will multiply the TF & IDF value to get the TF-IDF value.

Document-1: “The quick brown fox jumps over the lazy dog.”

TF-IDF(“the”) = 0.222 * 0 = 0
TF-IDF(“quick”) = 0.111 * 0.477 = 0.053
TF-IDF(“brown”) = 0.111 * 0.477 = 0.053
TF-IDF(“fox”) = 0.111 * 0.477 = 0.053
TF-IDF(“jumps”) = 0.111 * 0.477 = 0.053
TF-IDF(“over”) = 0.111 * 0.477 = 0.053
TF-IDF(“lazy”) = 0.111 * 0.477 = 0.053
TF-IDF(“dog”) = 0.111 * 0.176 = 0.020

Document-2: “The dog barks at the cat.”

TF-IDF(“the”) = 0.333 * 0 = 0
TF-IDF(“dog”) = 0.167 * 0.176 = 0.029
TF-IDF(“barks”) = 0.167 * 0.477 = 0.079
TF-IDF(“at”) = 0.167 * 0.477 = 0.079
TF-IDF(“cat”) = 0.167 * 0.176 = 0.029

Document-3: “The cat meows softly.”

TF-IDF(“the”) = 0.25 * 0 = 0
TF-IDF(“cat”) = 0.25 * 0.176 = 0.044
TF-IDF(“meows”) = 0.25 * 0.477 = 0.119
TF-IDF(“softly”) = 0.25 * 0.477 = 0.119

(7) Pros & Cons Of TF-IDF Technique.

Advantages:

Simplicity: TF-IDF is a relatively simple and easy-to-implement technique, making it a popular choice for many NLP tasks.
Interpretability: The TF-IDF values are easily interpretable, as they represent the importance of a word in a document based on its frequency and rarity in the corpus.
Effectiveness: TF-IDF has been shown to be effective in many NLP applications, such as text classification, information retrieval, and document summarization.
Sparsity: The TF-IDF vectors are typically sparse, meaning that most elements are zero, which can be computationally efficient for certain machine learning algorithms.
Flexibility: TF-IDF can be applied to a wide range of text-based tasks and can be combined with other techniques, such as feature selection or dimensionality reduction.

Disadvantages:

Lack of Semantic Understanding: TF-IDF only considers the frequency of words and does not capture the semantic or contextual relationships between them. This can limit the performance of TF-IDF in tasks that require a deeper understanding of language.
Sensitivity to Document Length: TF-IDF can be sensitive to the length of the documents, as longer documents may have higher TF-IDF values for the same words.
Inability to Capture Synonyms: TF-IDF treats each word as a unique feature and cannot capture the semantic similarity between words, such as synonyms.
Limited Handling of Rare Words: TF-IDF assigns high weights to rare words, which can lead to overfitting and poor generalization on unseen data.
Neglect of Word Order: TF-IDF ignores the order of words in a document, which can be important for understanding the context and meaning of the text.
Lack of Context-Awareness: TF-IDF does not consider the context in which a word appears, which can be crucial for understanding the meaning and sentiment of the text.

(8) Python Example

import pandas as pd
import numpy as np

df = pd.DataFrame({'text':["The quick brown fox jumps over the lazy dog.",
                           "The dog barks at the cat.", 
                           "The cat meows softly."], 
                   'output':[1,1,0]})
df

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

Printing IDF Values For Each Unique Words:

print(tfidf.idf_)
print(tfidf.get_feature_names_out())

Giving Data To TF-IDF Object:

tfidf.fit_transform(df['text']).toarray()

Praudyog

TF-IDF Word Vectorization

TF-IDF Word Vectorization

Table Of Contents:

(1) What Is TF-IDF Word Vectorization?

(2) Term Frequency

(3) Inverse Document Frequency

(4) TF-IDF Calculation

(5) Steps For TF-IDF Calculation

(6) Example Of TF-IDF Calculation

Step-1: Create The Vocabulary From The Corpus.

Step-2: Calculate The Term Frequency (TF) For Each Word In Each Document.

Document-1: “The quick brown fox jumps over the lazy dog.”

Document-2: “The dog barks at the cat.”

Document-3: “The cat meows softly.”

Step-3: Calculate The Inverse Document Frequency (TF) For Each Word.

Step-4: Calculate the TF-IDF values for each word in each document.

Document-1: “The quick brown fox jumps over the lazy dog.”

Document-2: “The dog barks at the cat.”

Document-3: “The cat meows softly.”

(7) Pros & Cons Of TF-IDF Technique.

Advantages:

Disadvantages:

(8) Python Example

Printing IDF Values For Each Unique Words:

Giving Data To TF-IDF Object:

Leave a Reply Cancel reply