IV. Data Preprocessing

In the data preprocessing part, we want to make our data set into a desired form and more usable. Most of this part's works are finished by using nltk packages.

First of all, we lowercase all the texts because it is case sensitive. Lowercasing can help us maintain the consistency flow when doing NLP. Since there might exists more than one sentences in a tweets, we need to tokenize the tweets texts into sentence first. Then, we may continue to tokenizing the sentences into words, which is the most common tokens. The tokenization step is helpful in understanding the texts and developing the models for NLP.

After tokenization, we will do the stemming-lemmatization-stopword removal step. This step is very import in helping us removing the noise in the data set. Stemming is the process of reducing a word to its word stem by removing its affixes to suffixes and prefixes. Lemmatization has the similar function like stemming. These two actions are closely related. However, stemming may sometimes return us a non-actual word, while lemmatization can always return us an actural word with same meaning. For example, the word "better" has "good" as its lemma but it will be missed by stemming. The stopword removal step can be useful in removing the words occur commonly in the texts but without any significant or useful meanings.

Import Necessary Packages & Reading Data sets

Lower case
Sentences tokenization
Words tokenization
Stemming - lemma - stopword

After all the data preprocessing steps above, we may get a list of comparably cleaning word tokens.