How Big Data discovered that Othello is a comedy (and other adventures in digital Shakespeare)

“[W]hat’s past is prologue,” wrote William Shakespeare, “what to come, in yours and my discharge.” These lines from The Tempest have now very unexpectedly come to define the fate of the internet user in times of Big Data, not to speak of Shakespeare himself, in the year of his 400th death anniversary.

Data is not new, it was always there. Even Sherlock Holmes cried out seeking data in moments when he found himself groping after amorphous facts: “Data! Data! Data! I can’t make bricks without clay!” But it is only in the last six years or so that data has started to be analysed through mathematical models in a very big way – from the private sector to state policy, from government UIDs to health care schemes, from music apps to Shakespearenomics.

Liberal arts in the age of Big Data or Big Data in liberal arts?

In February 2016, Vinod Khosla, Indian-American billionaire businessman and founder of Khosla Ventures, wrote a notorious and widely-circulated white paper, “Is majoring in liberal arts a mistake for students?” Contrary to the beliefs of a sensible world, he argued “[t]hough Jane Austen and Shakespeare might be important, they are far less important than many other things that are more relevant to make an intelligent, continuously learning citizen.”

Besides his other arguments, Khosla held that coding and data were more important than Shakespeare. It is not uncommon to come across this and much worse. In fact, at least a criticism of Shakespeare should be more welcome in the age of global connectivity – or in any age for that matter – rather than blind populism. The BBC’s broadcast of episode from a series dedicated to the bard, “Shakespeare’s Restless World,” has the presenter Neil MacGregor note:

When Shakespeare turned into a book, the man who built the Globe became a global figure. And that means global in the most up-to-date sense: Johnstoune's copy of the First Folio is now in Meisei University in Tokyo. But I am studying it in a cafe in London on my smartphone. In A Midsummer Night's Dream, Puck puts a girdle round the earth in forty minutes. In the world of modern magic, online Shakespeare circles the globe instantly.

But detractors of the liberal arts often take the importance of digital connectivity, Big Data and cloud computing too far, placing them at loggerheads, especially, with Shakespeare. Not lagging behind are the traditional liberal arts academics, who would be found queasy at any attempts to reconcile the literary discipline with data mining (the same techniques which are used in business analysis or microbial forensics).

Stanford University’s English Professor, Franco Moretti, had already started making inroads into the traditional ways of reading literature, through his literary lab, where he began “bringing a science-fiction thrill to the science of fiction.” The idea was that scholars would not have to read volumes of literature in order to acquire knowledge or even derive pleasure – data mining would suffice.

There are innumerable sources of puns in Shakespeare, which are now lost due to the transition from Original Pronunciation (OP) to Received Pronunciation, such as “about/a boat,” “loins/lines,” “groin/groan,” “whose, house,” “it/yet,” “woman/woe-man,” “to be/bay or not to be/bay,” “no/now is the winter of our discontent,” and so on. It is possible to hear traces of OP even today in Canada, Australia, and New Zealand, not just in Scotland or Wales. What if we just fed all that data and let our intuitive software just churn out a list of puns in Shakespeare, which would have been otherwise lost to even native “naked” ears (or airs?)?.

Linguists David and Ben Crystal have been serving as human data miners for many years now, revealing to Shakespeare lovers a wealth of the bard’s intended or even unintended puns. Their act goes to show how even the scholarly or the artistic mind is almost a mining engineer of scholarly aspects. BBC documentaries on Shakespeare’s Restless World (2012) or Shakespeare’s London (2000) where the commodities or cityscapes from the bard’s work are scanned, represented and expatiated on, are after all examples of using data mining, albeit in an indirect and sophisticated way.

In October, 2011 Michael Witmore, the director at the Shakespeare Folger Library, described the results of the unusual experiments he has been performing for long now, examining the plays of Shakespeare from the first folio with rhetorical analysis tools. The preliminary inference suggested that the language of the tragedies differed greatly from that of the comedies.

Further observations drive the notion that some of the tragedies might have been written originally with comedic cues, such as Othello – which at least linguistically, renders the play a comedy. Othello was found exceptionally rich in Shakespeare’s comedic vocabulary, containing several recycled elements from Twelfth Night, which traditionally is a comedy. Since 2004, Witmore has been publishing his digital researches into Shakespeare, beginning with the academic journal, Early Modern Literary Studies, and later his blog.

Furthermore, data mining is now capable of clarifying frequently raised doubts on the authorship of the plays attributed to the bard. Suspicions have been voiced from time whether some of the plays were really written by Shakespeare, and not Francis Bacon, Christopher Marlowe or the Earl of Oxford.

Recently, scientists at the University of California Berkeley studied the works of Shakespeare using feature frequency profiling techniques, which is also used to study family trees of nucleotide base and amino acid sequences. They found that, with the exception of Pericles, most of the other plays of Shakespeare, which allegedly could have been compositions of Marlowe or Bacon, matched the “Bard's cluster.” In other words, the equivocation on the question of authorship was eliminated.

Shakespeare’s digital signatures

Image courtesy: Nicholas Rougeux

It is not as easy to equate the liberal arts with the science of digitisation or data analysis. It is harder still for the latter to claim superiority. On his 400th death anniversary Shakespeare is offering stiff resistance to the data sciences to make headway without him. Without the liberal arts, as without the day-to-day things of human existence and welfare, data analysis runs the risk of becoming an art-for-art’s sake. In 2014 came the first Shakespearean sonnet written collaboratively by a human artist and artificial intelligence. The artist was J Nathan Matias – now a PhD student at MIT Media Lab – and a machine-learning powered word prediction engine, Swiftkey.

That even digital poetic artifice was being driven by Shakespeare was not enough. More recently, in 2016, data artist Nicholas Rougeux launched a series of Shakespeare’s digital signatures for each of his 154 sonnets. He took the sum of the letters on each line (corresponding to their rank in the English alphabet) and divided them by the number of letters in each. Then he plotted the value of each line of a sonnet on digital graph, until a signature emerged, when all the fourteen lines were plotted (see image above).

Not only did the artworks of the signatures show how unique each Shakespearean sonnet was, it also demonstrated that for a rich and meaningful application of data science, one must turn to Shakespeare every once in a while.

The bard has provisionally come to the rescue of the liberal arts, in the last decade or so. It is very much incumbent upon traditionalists to open up to the new possibilities of data mining. Art or artistry is no longer confined to what is received or stored in the repositories of tradition and nostalgia, but has already extended to domains of reading and interpretation. Data mining has opened up a whole new world of artistic expression in the act of how we interpret what we choose to interpret. It is for traditional liberalists to decide how liberal they really are and whether they intend to acknowledge big data tools in more universal ways.

The task of the literary scholar will be in no way mitigated, but only made more challenging, henceforward. There is bound to be exhaustion and even data mining tools might start complaining of ennui owing to being fed the same data sets or questions repeatedly.

We need many new Shakespeares, many more Londons to be discovered, and many more linguistic worlds to go with that if data and art are to keep each other alive. Unless poetical, philosophical and artistic examples are thrown up to challenge the new science of data mining, it might possibly lead to the degeneration of both the art and data. So, while art be the food of data, play on.

Arup K Chatterjee is the founding chief editor of Coldnoon: Travel Poetics –International Journal of Travel Writing.

Get the app

ANDROID iOS

Limited-time offer: Big stories, small price. Keep independent media alive. Become a Scroll member today!

Membership Benefits

No ads

Evening Edition newsletter

Members-only events

Weekly Edition newsletter

Merchandise

Help design new products

Editorial meet-up invitation

Our journalism is for everyone. But you can get special privileges by buying an annual Scroll Membership. Sign up today!