Idea #13: Metrics on Text

For the most part, you can only tell how good a book is after you have done some reading. Sometimes you are identified with how the author’s ideas resemble your own’s, and other times it’s the innovative writing style that gets you caught. All of this, regarded here as the ‘quality’ of the text, is too personal and subjective to be put in numbers.

However, there are some aspects of a text that do allow a quantification. Some of these attributes are

  1. how positive a text is, meaning that its words are associated with good memories and emotion;
  2. how innapropriate a text is, meaning containing socially frowned upon words;
  3. how erudite a text is, with elaborate words not commonly used;
  4. how technical, revealed by the presence of technical terms;

But why have this information before reading it?

Well, first, to aid in the decision of reading or not (or buying or not a book). If you have the measure of how negative a text is, and you are in a mood for a light reading, you probably would leave that for another day. Second, to estabilish some sort of relative distance between texts, a scale of how technical for example.

This is a proposed method of measuring the quality of a text or book. It is, of course, not exact math, as it will be seen. It’s simple but demands some work.

Lets tackle number 1 quality, how positive or negative a text is. For the average people, some words are associated with negative emotions or memories. For example, heaven is a ‘positive’ word while rape is a ‘negative’ word. Some words don’t evoke a strong response, such as the word banana, and are ‘neutral’:

heaven banana rape
waterfall shower prison
baby rock abortion

Example of a classification

There is a more comprehensive, but somewhat dubius list on this site: positive/negative. You will notice that I limited myself to nouns because they are easier to work with. There are many words that lie in a gray area that is best represented with neutral. To make it simple: when in doubt, it’s neutral.

So the first step is associating each word with +1, 0 or -1. It’s not required to classify every word (what would be impossible), just a large enough group of them.

Then the second step is to run the text counting each appearance of the word, summing it. For example the phrase ‘The baby took a shower’ would yield Q=1, while ‘The women had an abortion in prison.’ yields Q=-2. You can see how it works, the second sentence is obvioulsy a lot heavier then the first.

This is meant to be done by a computer.

We can fine tune this: in order to get the table of values for each word, a online survey could be estabilished presenting the user with an isolated word and asking it to rate it. Different users would answer about the same word.  The results would lead to a scale of the more negative or positive words, with statistical meaning.

The text is feed into the computer that compares each word with the table, and there you have the quality of a text. This requires the text to have had an electronic form at some point.

Books could have a series of that information printed on the back cover.

[IMAGE: Beautiful library of Trinity College, Dublin image source]