I love epic fantasy stories. All the characters, plot lines, world-building, and the little details that foreshadow big developments — I devour them in multiple readings. I also have some time to play with the Google Cloud Platform’s Natural Language API. So I decided to see what analysis I could do with that API on one of the most voluminous and detailed epic fantasies I’ve read: Robert Jordan’s The Wheel Of Time series.
I’m going to blog my efforts, in case something interesting or useful shakes out. I’m using this opportunity to teach myself Python as I go, since that language is popular in the A.I. work I’ve seen. So there will probably be some discoveries (and horrendous examples of code) along the way.
My rough plan has the following milestones:
- Get the Wheel Of Time books in plain text format, so the Google APIs can read them.
- See what the API’s sentiment analysis, entity analysis, entity-sentiment analysis, and syntactic analysis data looks like for The Wheel Of Time.
- Use sentiment analysis to graph the emotional arcs, in total and maybe of characters, of the story, to compare them to the “six main story arcs” discussed in this article in The Atlantic.
After that? Let’s see where the data can take me. I have some thoughts on creating a system that can answer questions about the story, and possibly expanding the training model to include labels and concepts, but I’ll focus on my first three milestones to begin with.
Part 1: Get The Wheel Of Time In Plain Text
Tor Books has the commendable policy of selling all their eBooks unlocked and DRM-free, and I already have the books on my Barnes & Noble Nook, so I started with the EPUB file format.
Unfortunately, the Nook app on Android devices hides your eBook files in a directory you can only access if you have root access to your device. I wasn’t interested in the warranty implications of going down that route. But the family Windows 10 machine has a free Nook app that downloads your eBooks. After that, it’s just a matter of searching the drive for where it put *.epub. I found them in the rather obscure directory:
Huzzah! Then I renamed the file as a .zip (an .epub is a .zip with a particular directory structure), and dug into the ZIP archive. Unzipped, the file looked like this:
| +- Images
The files I’m interested in are in OEBPS\. Each chapter or section (basically, each table of contents entry has an HTML file, conveniently named for sorting by the chapter number. The markup is clean and well-formed, and the style classes are intuitive. Cleaning it would be straightforward.
Now I had to learn some Python. I was familiar with the syntax, and I was an experienced Java programmer, so most of what I had to learn could be found on StackOverflow. Unless I got fancy, it would be a 1-use script, but there are 13 books in The Wheel Of Time (including the prequel, New Spring), so it wouldn’t hurt to take a stab at maintainability. I wanted to do proper Test-Driven Development, but I was getting impatient to see progress, so I just kept tweaking-and-running until I got it to work on real data.
I envisioned three components:
- Something that stripped out HTML markup from a character stream and left readable plain text.
- Something that created a plain text file from an HTML chapter file, using the first component.
- Something that took an EPUB file, unzipped it, iterated through the chapter files, used the previous component to make text file equivalents, and zipped them up for transfer to Google Cloud Platform, or wherever.
Python had a built in html.parser.HTMLParser which did exactly what I needed, for the first component. After that, it was all file I/O and some ZIP manipulation, all with standard packages. The “if __name__ == “__main__” construction for an executable class seemed awkward, but otherwise I was impressed with how compact the code was.
If you want to see my ugly-but-functional beginner Python code, feel free to peek on Github.
At this point, I had The Wheel Of Time all in a ZIP of plain text files. I was ready to figure out how to use the Google Cloud Platform’s Natural Language API. That will be in my next post.