Down the Data Rabbit Hole

DOWN

the

{DATA}

R

abbit hol

E

An analysis

of the text of

"Alice's

Advent-

-ures in

Wonderland"

by Lewis

Carroll,

using

Python 3,

NumPy, Pandas,

Matplotlib,

and NLTK

(or the

Natural

Language

Toolkit).

This

analysis

includes

data such as

VADER

Sentiment

analyses,

average

sentence

lengths

per chapter,

chunking

to extract

unique

named entities

from the text,

and various

figures to

visualize

the data.

I made

this

project

to practice

data analysis,

visualization,

and natural

language

processing.

I hope

you

like it!

-Terry

Zhou

What is Natural Language Processing?

Natural Language Processing, or NLP, is a computer's ability to understand spoken and written language as humans do. It's a cornerstone of AI, and has deep roots in data analysis and algorithms.

For this project, I'll be using the Natural Language Toolkit, or NLTK: a top platform for Python-based NLP. There are three submodules of the NLTK library that are the focus of this project:

Tokenization
VADER Sentiment Intensity Analyzation
Chunking

But first, let's learn a little more about our text...

“Speak English!” said the Eaglet. “I don’t know the meaning of half those long words, and, what’s more, I don’t believe you do either!”

- Chapter III: A Caucus-Race and a Long Tale

Text & Imports

“Begin at the beginning,” the King said gravely, “and go on till you come to the end: then stop.”

- Chapter XII: Alice's Evidence

Alice's Adventures in Wonderland, or Alice in Wonderland as it is more commonly known, is an English children's novel written in 1865 by Lewis Carroll, with illustrations by John Tenniel.

In 1991 it was made available for free by Project Gutenberg. I will be using a slightly modified version of that text for this project, available for download below:

alice.txt

The following are all the code imports used in this project:

import matplotlib.pyplot as pltimport pandas as pdimport seaborn as snsimport stringimport nltkimport numpy as npfrom vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

NLTK Tokens: How Long is the Book?

Tokenization is one of NLTK's most basic functionalities. It involves breaking strings down into smaller substrings, or 'tokens'. Strings can be tokenized into sentences, words, or individual characters.

For instance, the following commands allow us to break the text down into sentences:

alice_text = open("alice.txt", "r").read()alice_sentences = nltk.sent_tokenize(alice_text)

There are only 973 sentences in the text, split amongst 12 chapters. Most modern publication coaches advise an average sentence length of 14-20 words/sentence.

As we can see from the data, Alice in Wonderland's sentences average in the 40-50 word range—a far cry from modern writing styles. We can see this clearly enough from the very first sentence:

Sentiment Analysis

VADER (or Valence Aware Dictionary and sEntiment Reasoner) is a lexicon-based analytical tool built into NLTK. It allows us to analyze a given text for positive, negative, and neutral sentiment values, and score them accordingly. Like so:

sid = SentimentIntensityAnalyzer()compound_scores = []for sentence in alice_sentences:

compound_scores.append((sentence.replace("
", " "),

sid.polarity_scores(sentence)["compound"],sid.polarity_scores(sentence)["pos"],sid.polarity_scores(sentence)["neg"],sid.polarity_scores(sentence)["neu"],

The result is a value between for -1 and 1, with anything between -0.5 and 0.5 considered emotionally neutral.

We can use this to easily determine the most emotionally negative and positive sentences in the text.

The most negative sentence:

'“You’re nothing but a pack of cards!” At this the whole pack rose up into the air, and came flying down upon her: she gave a little scream, half of fright and half of anger, and tried to beat them off, and found herself lying on the bank, with her head in the lap of her sister, who was gently brushing away some dead leaves that had fluttered down from the trees upon her face.'

Compound Score: -0.9657

The most positive sentence:

'But her sister sat still just as she left her, leaning her head on her hand, watching the setting sun, and thinking of little Alice and all her wonderful Adventures, till she too began dreaming after a fashion, and this was her dream:— First, she dreamed of little Alice herself, and once again the tiny hands were clasped upon her knee, and the bright eager eyes were looking up into hers—she could hear the very tones of her voice, and see that queer little toss of her head to keep back the wandering hair that _would_ always get into her eyes—and still as she listened, or seemed to listen, the whole place around her became alive with the strange creatures of her little sister’s dream.'

Compound Score: 0.9745

VADER Analysis: Chapter by Chapter

We can also apply our VADER analysis to chapters as a whole; as we can see from our Mean Compound Score graph, Alice in Wonderland starts off quite positively with a compound sentiment score of over 0.10.

Notice the two troughs at Chapter V and Chapter IX. The trough in Chapter V makes sense; this is the part in the story where Alice is constantly shifting sizes, and getting into all sorts of misadventures; she gets stuck in a house, attacked by birds, is forced to recite poetry for an unreceptive audience, etc.

But the trough in Chapter IX is interesting. The chapter is innocuous enough: it mostly involves a conversation between Alice, the Gryphon, and the Mock Turtle. There are two factors to consider here.

Firstly, VADER analysis works based on common words. In this case, it seems that VADER thinks that the 'Mock' in 'Mock Turtle' should be scored negatively, as shown here:

"You did," said the Mock Turtle.{'neg': 0.359, 'neu': 0.641, 'pos': 0.0, 'compound': -0.4215}

Secondly, the Mock Turtle spends most of the chapter sobbing. Crying obviously has a negative VADER score, but in the context of the chapter it is used for comedy.

This points out an interesting gap in VADER analysis: as of yet, it is a literal technique. It cannot capture situational irony, nuance, or satire.

Chunking: What's in a Name?

Chunking is another basic NLP technique. It begins with tokenizing, or breaking down strings into smaller substrings. Then we can assign parts of speech to these substrings. That's chunking in a nutshell.

Once chunks have been created, we can parse through them to find Unique Named Entities: in other words, characters and places. Usually, we would be doing this hand-in-hand with a lemmatizer function that would remove common words that have been capitalized for literary effect.

However, one of the many quirks of Alice in Wonderland is that nearly all of the characters have common words as names. As such, a bit of manual entry is required to tell NLTK which capital words are simply words, and which are people.

`[('I', 'PRP'), ('shall', 'MD'), ('be', 'VB'), ('late', 'JJ'), ('!', '.'), ('”', 'NN'), ('(', '('), ('when', 'WRB'), ('she', 'PRP'), ('thought', 'VBD'), ('it', 'PRP'), ('over', 'IN'), ('afterwards', 'NNS'), (',', ','), ('it', 'PRP'), ('occurred', 'VBD'), ('to', 'TO'), ('her', 'PRP$'), ('that', 'IN'), ('she', 'PRP'), ('ought', 'MD'), ('to', 'TO'), ('have', 'VB'), ('wondered', 'VBN'), ('at', 'IN'), ('this', 'DT'), (',', ','), ('but', 'CC'), ('at', 'IN'), ('the', 'DT'), ('time', 'NN'), ('it', 'PRP'), ('all', 'DT'), ('seemed', 'VBD'), ('quite', 'RB'), ('natural', 'JJ'), (')', ')'), (';', ':'), ('but', 'CC'), ('when', 'WRB'), ('the', 'DT'), ('Rabbit', 'NNP'), ('actually', 'RB'), ('_took', 'VBZ'), ('a', 'DT'), ('watch', 'NN'), ('out', 'IN'), ('of', 'IN'), ('its', 'PRP$'), ('waistcoat-pocket_', 'NN'), (',', ','), ('and', 'CC'), ('looked', 'VBD'), ('at', 'IN'), ('it', 'PRP'), (',', ','), ('and', 'CC'), ('then', 'RB'), ('hurried', 'VBD'), ('on', 'IN'), (',', ','), ('Alice', 'NNP'), ('started', 'VBD'), ('to', 'TO'), ('her', 'PRP$'), ('feet', 'NNS'), (',', ','), ('for', 'IN'), ('it', 'PRP'), ('flashed', 'VBD'), ('across', 'IN'), ('her', 'PRP$'), ('mind', 'NN'), ('that', 'IN'), ('she', 'PRP'), ('had', 'VBD'), ('never', 'RB'), ('before', 'RB'), ('seen', 'VBN'), ('a', 'DT'), ('rabbit', 'NN'), ('with', 'IN'), ('either', 'DT'), ('a', 'DT'), ('waistcoat-pocket', 'NN'), (',', ','), ('or', 'CC'), ('a', 'DT'), ('watch', 'NN'), ('to', 'TO'), ('take', 'VB'), ('out', 'IN'), ('of', 'IN'), ('it', 'PRP'), (',', ','), ('and', 'CC'), ('burning', 'VBG'), ('with', 'IN'), ('curiosity', 'NN'), (',', ','), ('she', 'PRP'), ('ran', 'VBD'), ('across', 'IN'), ('the', 'DT'), ('field', 'NN'), ('after', 'IN'), ('it', 'PRP'), (',', ','), ('and', 'CC'), ('fortunately', 'RB'), ('was', 'VBD'), ('just', 'RB'), ('in', 'IN'), ('time', 'NN'), ('to', 'TO'), ('see', 'VB'), ('it', 'PRP'), ('pop', 'VB'), ('down', 'RP'), ('a', 'DT'), ('large', 'JJ'), ('rabbit-hole', 'JJ'), ('under', 'IN'), ('the', 'DT'), ('hedge', 'NN'), ('.', '.')]`

`['Rabbit', 'Alice']`

Chunking by Chapter

We can use the chunking from the previous section to discover how often characters appear in each chapter.

Due to the episodic nature of Alice in Wonderland, you'll notice that most characters' occurrence data show up as only a few blocks of color.

These correspond to the one or two chapters they appear in. Alice, obviously, appears fairly evenly in every chapter.

I
II
III
IV
V
VI
VII
VIII
IX
X
XI
XII

What Have We Learned?

Natural Language Processing is a powerful tool that can quickly parse large amounts of text and glean a variety of insights from it.

However, as with any program or algorithm, it takes a human being to sift through the end data and decide what's truly relevant, and what's a fluke of the process.

I hope you enjoyed going down the Data Rabbit Hole with me. Until next time!