III. Data Visualization

In this step, we will explore more about our data by utilizing data visualization. First, we want to investigate if there exists any features of the location of sending the disaster tweets. We plot several barplots here to achieve this goal. Then, we will do the analysis on the sentence level, word level, and character lever. We will visualize our analysis in these texts by plotting histograms to give a direct impression. Finally, we will plot 2 word clouds for the polpulation codded by target label. Then we may compare if there are any differences of the high frequency words between disaster tweets and non-disaster tweets.

Explore the target column distribution

Although it is our raw data that location names include city and country, we still can tell that American users are more likely to mark their location info, no matter disaster or not.

Explore number of sentences in tweets encoded by target 0 or 1
Explore number of words in tweets encoded by target 0 or 1
Explore number of characters in tweets encoded by target 0 or 1

Based on our analysis and visualizations, we can easily find that for the group of disaster tweets and the group of non-disaster tweets, they have similar features in on the sentence level, word level, and character level analysis.

Word Clouds

Comments

Through plotting the keywords, chacters length or number of sentences, we can tell there are not many major dofference between disaster and non-disaster tweets. That is a little different from our expectations. Also, this, no doubt, brings difficulty to our models.