Five Years of Reading

Data Analysis

2024

Reading

Dyl

March 01, 2024

Introduction

I've been using this google sheet to keep track of every book that I've read since 2019. Now that I've completed the fifth year of my records, it feels like a decent time to do a recap and analysis of this book log. I'll be using pandas and seaborn to explore and vizualise the data.

My goals of this post are to:

  1. Improve my data analysis, pandas and data vizualization skills.
  2. Find and share interesting patterns and insights from my reading history.
  3. Leave you with the impression that I'm an intellectual with eclectic taste.

If that sounds boring and you don't want to read any further, the TLDR is:

  • I finished 268 books (200 books and 68 audiobooks) over five years.
  • I read a total of 76,546 pages, which is 41.94 pages per day for 1,825 days.
  • I listened to 968 hours of audio, which is 0.53 hours per day for 1,825 days.
  • The average book length was 382.73 pages, average audiobook length was 14.24 hours.
  • My average book rating was 3.89 (compared to an average Goodreads rating of 4.20).

How many times am I going to have to use the word "average" in this post? Place your bets now!

I'll dive much deeper into these stats and more below! But first let me provide some context and background about the data and how it was collected.

Why I Track My Reading

  • Back in 2016, I started doing the annual 52 Book Challenge, where the goal is to read 52 books each year. So, I needed a way to keep track of my reading progress for the challenge.
  • I have a preference for physical books, and when you read 52 a year for consecutive years you quickly run out of space for them in your home. As a result, I try to only keep books that I would like to re-read at some point. I've also found that keeping a record of which books I've read makes it easier for me to part with them.
  • When tracking book data, it was always in the back of my mind that I would analyze this data at some point and so, here we are!
  • I have a shit memory. Really shit. Like, I remember 1% of what I read. I hate the idea of not being even able to remember which books I've read, which would definitely begin to happen considering I give books away.
  • Satisfaction and motivation - if you've read Atomic Habits, you might remember the Habit Tracker concept. Basically, it feels really good each time you mark a book as completed.

How I Track My Reading

  • I didn't start out with an elaborate google sheet. Initially, I was tracking my reading challenge progress with pen and paper, but then moved to google sheets at some point.
  • I can't quite recall where the inspiration for my current tracking process was, but the layout and information I record in the google sheet has been largely unchanged since 2019. Which means that I now have five full years of reading history to analyze.
  • I generally have to update the tracker twice for each book: once when I start it, and once when I finish it (to fill in the end date and my rating).
  • IMO, the most important attributes to track are: title, start date, end date, length and a rating. These are the attributes that are difficult to recall or recover later if you don't record them. Other attributes and metadata about the book, e.g. genre, author, author's gender etc, can be derived later solely from the book's title.
  • A common mistake made by Habit Tracker beginners is to be too ambituous in what you track: if the process of updating the tracker has too much friction, you'll find it hard to build a consistent habit of updating it. I definitely wouldn't recommend trying to track something as granular as the number of pages you read each day, for example.
  • The google sheet is convenient because I can update it easily my phone as well as my computer. It also acts as the backend for the /reading page on this site, which uses the Google Sheets API to retrieve the data.
  • On the /reading page I link to the Goodreads url and cover image for each book. Personally I think that most of the Goodreads website is dogshit and I only really use Goodreads as a reference for book ratings, which I've found it to be fairly useful for.

Data Preparation

  1. I exported the data from my google sheet as .tsv files (one for each year).
  2. I wrote a quick Puppeteer script to scrape metadata and ratings data from Goodreads. For each book, I scraped the following information from Goodreads:
  {
    "scrapedTitle": "Designing Data-Intensive Applications",
    "rating": "4.71",
    "ratingStats": "8,095 ratings and 746 reviews",
    "publicationInfo": "First published April 25, 2015",
    "fiveStarRatings": "6,216 (76%)",
    "fourStarRatings": "1,530 (18%)",
    "threeStarRatings": "278 (3%)",
    "twoStarRatings": "46 (<1%)",
    "oneStarRatings": "25 (<1%)",
    "title": "Designing Data-Intensive Applications"
  }

Note: I scraped the ratings for both books and audiobooks from Goodreads. The Audible store would likely be a better source of ratings for the audiobooks I listened to. To prevent scope creep for this work, I won't be creating an Audible scraper right now.

  1. I used ChatGPT to retrieve the gender details of each author. This worked really well, and it actually helped me to identify some name spelling errors I had made. It looked like this: chatgpt_genders

  2. I combined all of the data from these three sources into a single pandas dataframe: final_dataframe

I only have enough width here to show you a subset of the columns: subset-columns

You can look at the full dataset in this publicly available google sheet or even look at the Jupyter notebook on Github that I used to load/transform the data in this final_df dataframe.

Summary Statistics

Let's get started by looking at some summary statistics and distributions:

  • I finished 268 books (200 books and 68 audiobooks) over the five years.
  • The by year totals were: 52, 54, 57, 49 and 56 (from 2019 to 2023, respectively).
  • I read a total of 76,546 pages, which is 41.94 pages per day for 1,825 days.
  • I listened to 968 hours of audio, which is 0.53 hours per day for 1,825 days.
  • I had a strong preference for Non-Fiction (63.4%) over Fiction (36.6%).
  • I read 152 physical books (56.72%) vs. 48 Kindle books (17.91%) vs 68 audiobooks (25.37%).

FYI, I will for the most part use the verb "read" and will not explicitly refer to books/audiobooks separately. I will only distinguish between reading and listening when necessary (e.g. when looking at format specific statistics like number of pages or audiobook hours).

Ratings

The ratings I assign to books represent my subjective experience of the book. That is to say, I don’t try to take an objective viewpoint (e.g I don't consider the historical/cultural significance of each book).

Instead, I like to embrace the inherent subjectivity of book ratings and rate books based on much I enjoyed reading them. I use a 1 to 5 star rating system (with half stars) to do this.

  • My average book rating was 3.89 (compared to the average Goodreads score of 4.20), which means I generally rate books 7.3% lower than the Goodreads score.
  • If we look at the distribution of my ratings, it appears to be fairly gaussian:

ratings-dist

  • Observe that the distribution is centered around the most common score (4 stars), which I gave a total of 98 times (36.6%).
  • The lowest score I gave any book was 2.5 stars, which I gave to only 8 books (3%).
  • I gave out the highest possible score (5 stars) a total of 23 times (8.5%).

Genre

The genres used in this analysis were self-labeled by me, from a list of 15 possible values. Genre classification can be messy... e.g. is 1984 Science Fiction or is it Classics? I don't stress too much about getting this classification perfect and so there's bound to be some inaccuracies. I have a catch all "Miscellaneous" genre, which has for example, a few sports related books that I read.

Let's have a look at the breakdown of books read by genre:

genre-dist

  • My most reads genres were: History (14.18%), Science Fiction (11.94%) and Fantasy (10.82%).
  • My least read genre was Technical (with 5 books read ) - which is things like Computer Science or Machine Learning textbooks.
  • My selection was fairly well distributed (having read 10+ books in nearly all of the genres).

How did my average rating change depending on the genre of the books?

average-rating-by-genre

  • I'm very surprised to see Philosophy (3.46) as the genre with the lowest average rating.
  • I was also not expecting for True Crime (4.50) to be my highest rated genre! It's worth noting that the sample size for True Crime is quite small, at only 7 books.

High Bar for Selection

Goodreads is my primary source for rating information when I'm trying to suss out if a given book is worth reading.

  • It turns out that I have a very high ratings bar for selecting which books I read:

goodreads-ratings-dist

  • The above chart shows (binned) Goodreads scores for all the books I read, where each bar represents a 0.25 width rating bin.
  • 85% of the books I read had a Goodreads score of 4.0 or above.
  • Only 6.35% of the books I read had a Goodreads score that was under 3.90!

Note: The book ratings that I scraped from Goodreads represent a single point in time, but we should expect ratings to change over the lifetime of a book:

  • Recently published books, especially, will have some variability from the current rating in the years after its release.
  • Consider for example, when a popular fiction author releases a new book. It’s most likely to be read early on by fans of that author, who will have a favorable bias towards the author's writing style.
  • I don’t think this will have any meaningful effect on the analysis here, but I thought it was an interesting point to mention.

Gender

  • A whopping 92.5% of the books I read were written by male authors, compared to only 7.5% books by female authors.
  • My average rating for female authored books was 3.70, significantly lower than my 3.91 average rating for male authored books.

How should we interpret this? Can we draw the conclusion that I have a very strong preference towards male authors? Do I have subconcious or even concious bias against female authors?

Not necessarily! Take for example my favorite genres, i.e. History, Science Fiction and Fantasy. If the majority of authors publishing books in these genres are male then we would expect that I should read more books written by men. I did a very quick google search to find some data here:

  • 76% of History books published in the US (in 2015) were written by men (source).
  • 78% of Science Fiction submissions to a major publisher were from men (source).

I'd like to note that these are statistics for recent years. I would expect that the further back in time you go, these genres will be even more heavily dominated by male authors (due to unfortunate historical biases around the role of women).

We will see (in a section to follow) that 30% of the books I read were published pre-21st Century. I would posit that the gender distribution of authors will be even more heavily skewed towards men in these genre and period combinations, which might help to further explain the notionally high 92.5% male authors figure.

I wanted to consider the Fantasy genre separately, as it's not as heavily dominated by men:

  • I've seen authorship ratios in this genre described as roughly 66.66% male and 33.33% female.
  • Unlike History and Science Fiction, there are lots of popular female dominated fantasy sub-genres (of which I'm not the target audience).
  • I think the fantasy sub-genres I like, e.g. High Fantasy or Epic Fantasy, are more heavily male dominated.

I reckok it's fair to say male authors tend to mostly write male protagonists:

  • It's probably easier to yourself in the POV of characters of your gender? It sounds plausible to me that this might have a measurable impact on your subjective enjoyment of a book?
  • I really enjoyed Mark Lawrence's Book of the Ancestor and Brandon Sanderson's Mistborn series', both of which have a great female protagonist.
  • On the other hand I absolutely hated Samantha Shannon's The Priory of the Orange Tree.
  • Does this mean that I can only enjoy female protagonists if they're written by male authors? I don't think I have enough data to draw conclusions. All I know for the certain is that The Priory of the Orange Tree is a really shit book.

I hope that hasn't come across in a way that makes it feel like I need to excuse myself for mostly reading male-authored books. I just read books that I think I will enjoy, and that's good enough for me.

There are excellent authors of both genders in every genre. If I do end up reading more Robin Hobb or Ursula K Le Guin this year it'll be because they are exceptional writers, not because they happen to be women.

I just took a quick look at the 2023 Goodreads Best Fantasy Award, I can see that 15 out of the 20 nominees were for female authors. So I think they're doing just fine without my readership :)

Book Length

  • The average book length was 382.73 pages.
  • The shortest book I read was Seneca's On the Shortness of Life (31 pages).
  • The longest book I read was Stephen King's The Stand (1,206 pages).
  • Unsurprisingly, the majority of books I read fell in the range of 200-500 pages long:

pages-dist

I also wanted to see if the length had an impact on Goodreads scores. We can use a scatterplot for this:

rating-vs-page-length

We can see quickly see that:

  • All of the longest books were Fiction (rather than Non-Fiction).
  • Nearly all of the 600+ page books I read had a score of 4.20 or above.
  • This is quite interesting - I wonder is it because I have a selection bias to only choose such long books if they are known to be exceptional?
  • It could also just be an example of some sort of sunk cost or loss aversion bias, whereby once readers have invested so much time and effort into reading a huge tome, they can't admit that it might have been below average or a waste of time?

We can also look at the distribution of audiobook durations in hours:

hours-dist

  • The average audiobook length was 14.24 hours.
  • The shortest audiobook I listened to was *Rob Fitzpatrick's The Mom Test (4 hours).
  • The longest audiobook I listened to was Ian Kershaw's Hitler (44 hours)
  • Most of the audiobooks I read fall in the 5-20 hour range.
  • Over time I've found that 8-15 hours is the sweet spot for audiobook length.
  • I find it fairly difficult to stay engaged with an audiobook once it goes over the 20 hour mark.

Period Published

I used the year of publication to bucket each book into one of six periods. These periods are:

  • Antiquity (prior to 499CE)
  • Post Classical (500 CE to 1499 CE)
  • Early Modern (1500 CE to 1799 CE)
  • Late Modern (1800 CE to 1944 CE)
  • Contemporary 20th (1945 CE to 1999 CE)
  • Contemporary 21st (after 2000 CE)

Let's use these periods to plot a distribution of when the books I read were published:

period-dist

  • ~69.78% of the books I read were written in the 21st Century - a lot higher than I expected.
  • Only 11.57% of the books I read were written before the end of WW2 (1945).

Book Format

This chart shows the (cumulative) count of books I completed over time for each format (Physical vs Kindle vs Audiobook): books-by-format-trend

Physical vs Kindle:

  • There was a big change to my Kindle vs Physical usage from February 2023 - which is when I injured my neck snowboarding. I now have to really avoid sitting and/or looking down for long periods.
  • Since then, I have had to do most of my reading on my Kindle, which I can hold with one hand while lying on my back or side in bed. I can't wait to get back to physical books though, here's hoping my neck injury eventually improves :)

Audiobooks:

  • My audiobook consumption rate was mostly constant since I started tracking audiobooks in mid-2019.
  • My audiobook numbers have ticked up slightly since September 2023, which is when Spotify introduced free audiobooks (as part of my existing paid membership).
  • I now get 16 free audiobook hours per month on Spotify (vs. a one book per month model on Audible).
  • I like these two different pricing models, because I can use an Audible credit for really long audiobooks and Spotify hours for short books.

Monthly Reading Numbers

Let's use a heatmap to visualize the number of pages I read in each of the 60 months: monthly-pages-heatmap

  • My average number of monthly pages read across the five years was 1,276 pages.
  • My best month was October 2023 (2,394 pages) - which is when I reread Red Rising books 1-5 and the sixth book Lightbringer.
  • My worst month was December 2023 (400 pages) - I spent most of this month travelling (Singapore, Ireland, London, Japan) and I had already hit my reading goal for the year by the end of November.

Let's look at the same heatmap vizualisation, but this time we are plotting the number of hours of audiobook listening for each month: monthly-hours-heatmap

  • The average number of audiobook hours I listened to per month across the five years was 16.13 hours.
  • My best audiobook months were September 2023 (46.0 hours) and October 2023 (41.78 hours). I was spending a lot of time walking outside at the time, which has become my primary mode of exercise since the aforementioned neck injury :(
  • There's not really a standout lowest month for audiobook listening. I listen to a lot of podcasts as well audiobooks, and so my audiobook listening hours fluctuate month to month depending on my podcast episode backlog.
  • I did not start listening to audiobooks until in June 2019, which is why we have no data for the first six months of 2019.

Rereads

  • Out of the 268 books I completed, 12 were rereads (4.48%), roughly equivalent to 2 or 3 books per year.
  • This number (12) was actually a lot lower than I was expecting it to be. I would have guessed I reread 5 books per year.
  • My average rating across all these rereads was 4.35 (compared to my overall average rating of 3.89).
  • Five out of these rereads were books 1-5 of the Red Rising saga.
  • This also included multiple rereads of 1984 and The Picture of Dorian Gray (which is my favorite book).

Book Club Reads

Anecdotally I have found that I seem to enjoy books less when I read them as part of a book club. Why is this?

I only read four books as part of various book clubs:

  • Meghan Majumdar's A Burning - My rating was: 3.0 vs the Goodreads score: 3.73.
  • Delia Owens' Where The Crawdads Sing - My rating was: 4.0 vs the Goodreads score: 4.39.
  • J.P. Delaney's Believe Me - My rating was: 2.5 vs the Goodreads score: 3.70.
  • Kazuo Ishiguro's Never Let Me Go - My rating was: 3.0 vs the Goodreads score: 3.85.

Although four books might not be enough to draw any statistically significant conclusions, I still want to see if the data backs up this feeling I have..

  • The average Goodreads score for these four bookclub reads was 3.92, which is less than the overall average Goodreads score (4.20) of all books I read. So we could say that the book club reads were objectively worse than my usual fare.
  • My average rating (3.12) for these four books is significantly less than my overall average rating of 3.89.
  • This means that for book club reads my average rating delta (vs. Goodreads) was -0.79, compared to my overall average rating delta of -0.31. So I have a tendency to rate these books even lower vs Goodreads than I usually would.
  • I guess this is not really surprising, as with a book club I have only a very marginal input into what books are selected and I'm more likely to be reading genres/topics that are outside of my usual interest areas.

I feel like I enjoy bookclub reads less because I'm too concentrated on analyzing the book for things to talk about, instead of just enjoying the book for its own sake.

Unique Authors

I read books by 216 unique authors, out of the total 268 books.

There were 187 authors who I only read a single book from.

I read two books from 17 authors.

There were 12 authors that I read three or more books from:

  1. Pierce Brown - 8 books
  2. George Orwell - 5 books
  3. Douglas Adams - 5 books
  4. Mark Lawrence - 4 books
  5. Walter Isaacson - 4 books
  6. Brandon Sanderson - 3 books
  7. Fyodor Dostoevsky - 3 books
  8. Joe Abercrombie - 3 books
  9. Stephen Fry - 3 books
  10. James S. A. Corey - 3 books
  11. Oscar Wilde - 3 books
  12. Brian McClellan - 3 books

Are Older Books Better?

I was recently listening to Patrick Collison's 2018 appearance on the Tim Ferris Show. When asked by Tim how he decides which books are worth reading, Patrick said that a very simple heuristic is to read books that are at least 10 years old.

The idea being: the cream will rise to the top. The books that are still recommended after 10 years are likely to be better ones. Patrick is a profilic reader (see his bookshelf) - so I was really curious to check this heuristic against my reading data.

I split the books I read into two groups:

  • Group 1: was the books 10+ years old at the time of reading.
  • Group 2: was the books less than 10 years at the time of reading.
  • A total of 127 (47.39%) books were 10+ years old vs 141 books (52.61%) that were less than 10 years old.
  • My average rating for the 10+ year old books was 3.84 vs 3.94 for the more recent books.

Patrick's heuristic sounds reasonable but I wasn't able to verify it based off of my admittedly subjective data.

Let's look at a different timeframe:

  • We will define "reading on release" as reading a book within the first two years of its release.
  • I read 44 books (16.42%) on release vs. 224 (83.58%) books that were older than 2 years.
  • My average rating for the "read on release" books was 3.94, compared to a slightly lower 3.88 for the older books.

Again, my data does not suggest that I will prefer books because they are older:

  • I have shown earlier that I have a pretty high selection bar (85% of the books I read have a 4.0 or higher on Goodreads).
  • This might mean that Patrick's heuristic isn't as important for me, because I'm already filtering books through an objective ratings-based heuristic.

Books I Did Not Finish

There are certain types of books that are difficult to read depending on what else is going on in my life; sometimes I have started books at the wrong time and ended up making a conscious decision to not finish them.

In the five years our data is for, this happened with only a few books:

These books are all excellent and I intend to revisit each of them at some point.

You might think that five is a low number here. A side-effect of doing a reading challenge (e.g. with a goal of completing 52 books per year), is that it really discourages you from giving up on a book after you've sunk a meaningful amount of time into it. I think you could choose to interpret this as a good thing or a bad thing.

Closing Thoughts

One thing that I wanted to acknowledge here was how useful ChatGPT was when I was first learning the pandas/matplotlib APIs. This is the first time I've found an LLM to be so useful on a programming task. It was able to consistently provide me with great answers to all questions of the form "How can I do X on a pandas df".

P.S.

How many times am I going to have to use the word "average" in this post?

I used the word average 29 times lol.

What would I do differently?

There are some things that I wish that I would have tracked from the beginning:

  • Recording a one or two sentence description of each book, and/or how it made me feel.
  • Keeping track of how I found or was recommended each book (to attribute recommendation quality back to the source).
  • Keeping track of how long a book was on my to-read list before I bought it - I would like to know if there's any correlation between "impulse buying" books and my eventual ratings.
  • Tagging books based on their topics (in addition to labelling the main genre).

I would also have reduced the scope of this blog post, it took me too damn long to write!

What Do Now?

  • I have no plans to ever stop tracking my reading.. it has become second nature to me.

  • I intend to read less books in 2024. My aim in doing this is to spend less of my time on passive consumption (i.e. reading) and more time spent on active learning (e.g. writing or technical projects).

  • Not chasing a specific number of books per year (i.e. 52) will also free me up to read more non-book content - for example Paul Graham's Essays and CS/ML blogs & papers.