II. Data Description

For the second step, we will give the description for each factor in the data sets to illustrate what it is. Also, we will check if there are any missing values and the types of each factor. This will help us better handle this data set and do the in-depth data analysis and data preprocessing steps in the following of this project.

Import Necessary Packages and Reading Datasets

Variable description

Explore the size and composition of the data sets

Based on the data set above, we know that there are 7613 observations in the training data set. It contains 5 variables, which are id, keyword, location, text, and target.

"Target" is the dependent variable, while 4 of the others are independent variables. For these 5 variables, Keyword, location, and text are categorical variables or texts, the rest are numerical variables.

Also, we can see that there are 3263 observations in the test data set. It contians 4 variables.

Explore the missing values in both training and testing data sets

Based on the summary abover, it is easy for us to see that lots of the observations in the training or testing data set having missing values in the location part.