the
{DATA}
abbit hol
An analysis
of the text of
"Alice's
Advent-
-ures in
Wonderland"
by Lewis
Carroll,
using
Python 3,
NumPy, Pandas,
Matplotlib,
and NLTK
(or the
Natural
Language
Toolkit).
This
analysis
includes
data such as
VADER
Sentiment
analyses,
average
sentence
lengths
per chapter,
chunking
to extract
unique
named entities
from the text,
and various
figures to
visualize
the data.
I made
this
project
to practice
data analysis,
visualization,
and natural
language
processing.
I hope
you
like it!
-Terry
Zhou
Natural Language Processing, or NLP, is a computer's ability to understand spoken and written language as humans do. It's a cornerstone of AI, and has deep roots in data analysis and algorithms.
For this project, I'll be using the Natural Language Toolkit, or NLTK: a top platform for Python-based NLP. There are three submodules of the NLTK library that are the focus of this project:
Tokenization
VADER Sentiment Intensity Analyzation
Chunking
But first, let's learn a little more about our text...
Alice's Adventures in Wonderland, or Alice in Wonderland as it is more commonly known, is an English children's novel written in 1865 by Lewis Carroll, with illustrations by John Tenniel.
In 1991 it was made available for free by Project Gutenberg. I will be using a slightly modified version of that text for this project, available for download below:
alice.txtThe following are all the code imports used in this project:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import string
import nltk
import numpy as np
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Tokenization is one of NLTK's most basic functionalities. It involves breaking strings down into smaller substrings, or 'tokens'. Strings can be tokenized into sentences, words, or individual characters.
For instance, the following commands allow us to break the text down into sentences:
alice_text = open("alice.txt", "r").read()
alice_sentences = nltk.sent_tokenize(alice_text)
There are only 973 sentences in the text, split amongst 12 chapters. Most modern publication coaches advise an average sentence length of 14-20 words/sentence.
As we can see from the data, Alice in Wonderland's sentences average in the 40-50 word range—a far cry from modern writing styles. We can see this clearly enough from the very first sentence:
VADER (or Valence Aware Dictionary and sEntiment Reasoner) is a lexicon-based analytical tool built into NLTK. It allows us to analyze a given text for positive, negative, and neutral sentiment values, and score them accordingly. Like so:
sid = SentimentIntensityAnalyzer()
compound_scores = []
for sentence in alice_sentences:
compound_scores.append((sentence.replace("
", " "),
sid.polarity_scores(sentence)["compound"],
sid.polarity_scores(sentence)["pos"],
sid.polarity_scores(sentence)["neg"],
sid.polarity_scores(sentence)["neu"],
The result is a value between for -1 and 1, with anything between -0.5 and 0.5 considered emotionally neutral.
We can use this to easily determine the most emotionally negative and positive sentences in the text.
Compound Score: -0.9657
Compound Score: 0.9745
We can also apply our VADER analysis to chapters as a whole; as we can see from our Mean Compound Score graph, Alice in Wonderland starts off quite positively with a compound sentiment score of over 0.10.
Notice the two troughs at Chapter V and Chapter IX. The trough in Chapter V makes sense; this is the part in the story where Alice is constantly shifting sizes, and getting into all sorts of misadventures; she gets stuck in a house, attacked by birds, is forced to recite poetry for an unreceptive audience, etc.
But the trough in Chapter IX is interesting. The chapter is innocuous enough: it mostly involves a conversation between Alice, the Gryphon, and the Mock Turtle. There are two factors to consider here.
Firstly, VADER analysis works based on common words. In this case, it seems that VADER thinks that the 'Mock' in 'Mock Turtle' should be scored negatively, as shown here:
"You did," said the Mock Turtle.
{'neg': 0.359, 'neu': 0.641, 'pos': 0.0, 'compound': -0.4215}
Secondly, the Mock Turtle spends most of the chapter sobbing. Crying obviously has a negative VADER score, but in the context of the chapter it is used for comedy.
This points out an interesting gap in VADER analysis: as of yet, it is a literal technique. It cannot capture situational irony, nuance, or satire.
Chunking is another basic NLP technique. It begins with tokenizing, or breaking down strings into smaller substrings. Then we can assign parts of speech to these substrings. That's chunking in a nutshell.
Once chunks have been created, we can parse through them to find Unique Named Entities: in other words, characters and places. Usually, we would be doing this hand-in-hand with a lemmatizer function that would remove common words that have been capitalized for literary effect.
However, one of the many quirks of Alice in Wonderland is that nearly all of the characters have common words as names. As such, a bit of manual entry is required to tell NLTK which capital words are simply words, and which are people.
`[('I', 'PRP'), ('shall', 'MD'), ('be', 'VB'), ('late', 'JJ'), ('!', '.'), ('”', 'NN'), ('(', '('), ('when', 'WRB'), ('she', 'PRP'), ('thought', 'VBD'), ('it', 'PRP'), ('over', 'IN'), ('afterwards', 'NNS'), (',', ','), ('it', 'PRP'), ('occurred', 'VBD'), ('to', 'TO'), ('her', 'PRP$'), ('that', 'IN'), ('she', 'PRP'), ('ought', 'MD'), ('to', 'TO'), ('have', 'VB'), ('wondered', 'VBN'), ('at', 'IN'), ('this', 'DT'), (',', ','), ('but', 'CC'), ('at', 'IN'), ('the', 'DT'), ('time', 'NN'), ('it', 'PRP'), ('all', 'DT'), ('seemed', 'VBD'), ('quite', 'RB'), ('natural', 'JJ'), (')', ')'), (';', ':'), ('but', 'CC'), ('when', 'WRB'), ('the', 'DT'), ('Rabbit', 'NNP'), ('actually', 'RB'), ('_took', 'VBZ'), ('a', 'DT'), ('watch', 'NN'), ('out', 'IN'), ('of', 'IN'), ('its', 'PRP$'), ('waistcoat-pocket_', 'NN'), (',', ','), ('and', 'CC'), ('looked', 'VBD'), ('at', 'IN'), ('it', 'PRP'), (',', ','), ('and', 'CC'), ('then', 'RB'), ('hurried', 'VBD'), ('on', 'IN'), (',', ','), ('Alice', 'NNP'), ('started', 'VBD'), ('to', 'TO'), ('her', 'PRP$'), ('feet', 'NNS'), (',', ','), ('for', 'IN'), ('it', 'PRP'), ('flashed', 'VBD'), ('across', 'IN'), ('her', 'PRP$'), ('mind', 'NN'), ('that', 'IN'), ('she', 'PRP'), ('had', 'VBD'), ('never', 'RB'), ('before', 'RB'), ('seen', 'VBN'), ('a', 'DT'), ('rabbit', 'NN'), ('with', 'IN'), ('either', 'DT'), ('a', 'DT'), ('waistcoat-pocket', 'NN'), (',', ','), ('or', 'CC'), ('a', 'DT'), ('watch', 'NN'), ('to', 'TO'), ('take', 'VB'), ('out', 'IN'), ('of', 'IN'), ('it', 'PRP'), (',', ','), ('and', 'CC'), ('burning', 'VBG'), ('with', 'IN'), ('curiosity', 'NN'), (',', ','), ('she', 'PRP'), ('ran', 'VBD'), ('across', 'IN'), ('the', 'DT'), ('field', 'NN'), ('after', 'IN'), ('it', 'PRP'), (',', ','), ('and', 'CC'), ('fortunately', 'RB'), ('was', 'VBD'), ('just', 'RB'), ('in', 'IN'), ('time', 'NN'), ('to', 'TO'), ('see', 'VB'), ('it', 'PRP'), ('pop', 'VB'), ('down', 'RP'), ('a', 'DT'), ('large', 'JJ'), ('rabbit-hole', 'JJ'), ('under', 'IN'), ('the', 'DT'), ('hedge', 'NN'), ('.', '.')]`
`['Rabbit', 'Alice']`
We can use the chunking from the previous section to discover how often characters appear in each chapter.
Due to the episodic nature of Alice in Wonderland, you'll notice that most characters' occurrence data show up as only a few blocks of color.
These correspond to the one or two chapters they appear in. Alice, obviously, appears fairly evenly in every chapter.
Natural Language Processing is a powerful tool that can quickly parse large amounts of text and glean a variety of insights from it.
However, as with any program or algorithm, it takes a human being to sift through the end data and decide what's truly relevant, and what's a fluke of the process.
I hope you enjoyed going down the Data Rabbit Hole with me. Until next time!