Part V: Vectorization

What is Vectorization?

Vectorization transfers tokens to a vector.

We've already done the primary data processing, and now we should transform text into a meaningful vector of numbers:

We've done very similar analysis during lecture, and now, we can apply this into our analyzation:

Previous Part where we've done text cleaning:

We can see above that the major part we want to analyze is the text column in our dataset, and we need to transfer tokens to a vector

We've briefly mentioned about vectorization during lecture 7-1, and now we can try this one ourselves. (Important: We used the cleaned data)

Then, we can use the .todense() function discussed during the lecture since .todense() convert sparse matrix to a dense matrix

We've also discussed One-hot Encoding during the lecture. We can avoid emphasizing on high-frequency words by ignoring frequency altogether.

Also, we can apply One-hot Encoding to the testing dataset.

(This is copied from lecture note week 7 - 1) Term frequency-inverse document frequency (tf-idf) statistics put terms on approximately the same scale while also emphasizing relatively rare terms. There are several different tf-idf statistics.

The smoothed tf-idf, for a term $t$ and document $d$, is given by:

$$ \operatorname{tf-idf}(t, d) = \operatorname{tf}(t, d) \cdot \log \left( \frac{N}{1 + n_t} \right) $$

where $N$ is the total number of documents and $n_t$ is the number of documents that contain $t$.

The sklearn.feature_extraction.text submodule of scikit-learn provides a function for computing tf-idf:

To be more precise:

Term Frequency:

TF = (Number of times term x(for ex.) appears in a doc.)/(Total number of terms in doc. )

Inverse Document Frequency: IDF = 1 + log(N/n), N is the number of documents, n is the number of documents that contatin t

Measuring Similarity:

We can measure the similarity of two documents by computing the distance between between their term frequency vectors, and then we can try to compute the cosine similarity.

Cosine similarity often works well for language data. The cosine similarity between two vectors $a$ and $b$ is defined as:

$$ \frac{a \cdot b}{\Vert a \Vert \Vert b \Vert} $$

where $\Vert \cdot \Vert$ is the Euclidean norm.

Calculating the cosine similarity shows that: We can still find some similarities between different terms. Though the numbers are quite small, we still conclude that there may exists some similarities between different words in our text column in our initial dataset.