Amazon’s Text Stats and a Little Orwell
Watching changes on Amazon.com is a good way to see how much one’s information can be stretched. The continual refinement of suggested books and other items is a little disturbing, but it often yields titles that I find useful. The Gold Box with its game show approach to sales is an example of the information mining. To use the Gold Box one clicks on the box and then one is offered an item that ususally relates to something you purchsed before or at least looked at. When the item is on screen, one must choose between accepting the sale offer or passing on it to see the next offer with no chance to go back to the previous offer. All decisions must be made within one hour of opening the box. I have opened the box a few times and am often surprised by some of the items that show up in there. Given how often Amazon seems to correlate interests, when what seems to be an aberration pops up, I wonder whether it is a random shot to see if it will stick or whether in some deep way Amazon has discerned that I have a hidden desire for vitamins, herbal remedies, or hairdryers. So when I saw that Amazon had added Text Stats I had to poke around. After all who knows what information would come my way by seeing the statistics (whatever they may be) on a book?
I found that not all books have this information but it seems that when publishers play along Amazon will give a book’s statistics including syllables per word; words per sentence; total number of characters, words, and sentences; and my favorites, the “Fun Stats,” words per dollar and words per ounce. Amazon takes this information and gives scores for Readability (explained below). Apparently the Bible, depending on the edition, requires either a twelfth grade reading level or a tenth grade reading level . Yet, one study of government Web sites states that “half of Americans read at no higher than the 8th grade level.” You may draw your own conclusions.
Text stats also gives information about where the book is in relation to all other books (and in some cases one can compare within classes of texts). So I started to poke around and it seems that (if we take the numbers seriously and there is reason not to do so when one examines exactly what readability means) perhaps the best writing correlates to simpler writing which reminded me of Orwell’s Politics and the English Language but I’ll get to that later. To have fun and play with that idea I looked at the Modern Library’s list of 100 best novels to see how they compared to all text in the Amazon set and then within literature.
Ulysses text stats show that a ninth grade reading level is required under the fog test and that 80 percent of all books are more difficult. Furthermore 10 percent of the words are complex but that means that 71 percent of all books have more complex words. At 1.5 syllables per word, 73 percent of books have more syllables per word; yet, with only 12.1 words per sentence 76 percent of literature books have more words per sentence. Oh yes, one obtains 16,776 words per dollar and 9,516 words per ounce for the Modern Library edition.
How do other books do? Only 16 percent of texts are more difficult than Heidegger’s Being and Time (not on the list). Looking at the rest of the Modern Library’s list of 100 best novels here are the numbers for the top ten: Ulysses, 80 percent are more difficult; The Great Gatsby, 79 percent; A Portrait of an Artist as a Young Man, 62 percent; Lolita, 49 percent; Brave New World, 77 percent; The Sound and the Fury; 96 percent; Catch-22, 75 percent; Darkness at Noon, (unavailable); Sons and Lovers, 93 percent; Grapes of Wrath, 90 percent.
If we take the same list and compare it to literature authors A to Z (Amazon’s classification):
Ulysses, 69 percent; The Great Gatsby, 65 percent; A Portrait of an Artist as a Young Man, 33 percent; Lolita, 21 percent; Brave New World, N/A; The Sound and the Fury; 93 percent; Catch-22, 56 percent; Darkness at Noon, (unavailable); Sons and Lovers, 96 percent; Grapes of Wrath, 92 percent.
What does all this information mean? Again draw your own conclusions. It may be a matter of style. Although The Sound and the Fury is technically simple, I don’t think one would suggest that a less than sixth grade reading level is needed to read and understand it. Still I wonder if someone could cull the data and find that better writing correlates with simpler words and sentences. That idea reminds me of Orwell’s “Politics and the English Language” where he is clear that he does prescribe rote rules of language or “setting up of a “standard English’” but still offers some rules to guide writers:
(i) Never use a metaphor, simile, or other figure of speech which you are used to seeing in print.
(ii) Never us a long word where a short one will do.
(iii) If it is possible to cut a word out, always cut it out.
(iv) Never use the passive where you can use the active.
(v) Never use a foreign phrase, a scientific word, or a jargon word if you can think of an everyday English equivalent.
(vi) Break any of these rules sooner than say anything outright barbarous.
Sounds like marching orders. Marching orders from Orwell. Hmmm.
For those curious about readability etc., Amazon gives three scores based on the Fog Index, Flesch Index, and the Flesch-Kincaid Index. The site had explanations of what readability meant but the link seems to be dead for a few weeks now. It seems that the Fog index is the Gunning Fog index which uses this formula ((words/sentence) + 100 * (complex words/words)) * 0.4 where complex words are words with three or more syllables. The score will be from the single digits on up. The score equals the grade reading level. So a score of eight means an eighth grade reading level is required; a score of 12, a twelfth grade level, and so on. The Flesch-Kincaid formulas purport to show readability as well.