More broadly, natural language processing transforms language into constructs that can be usefully manipulated. Because deep learning embeddings have proven to be so powerful, they've also become the default: pick a model, embed your data, pick a metric, do some RAG. To add new value, it is helpful to have a different view of complex language.
The one I will share today began years ago, with a single book.
The orchid thief is both non-fiction and full of mischief. I first read it when I was 20, skipping most of the historical anecdata, eager for its first-person accounts. At the time, I laughed out loud, but I turned the pages in quiet fury, because someone could live so deeply and write so well. I wasn't so sure they were different things.
After a year I moved to London to start again.
I went into financial services, which is like a theme park for nerds. And, for the next decade, he would only take jobs with a lot of writing.
A lot is the key word.
Behind the modern facade of professional services, British industry is alive in its old factories and shipyards. He employs Alice to do one thing and then hands her over to Bob; he turns a few screws and passes it to Charlie. A month later, we're all doing it again. As a newcomer, I realized that habits were not so much a ditch to fall into, but a mound to stand on.
I was also reading a lot. Okay, I was reading the New Yorker. My favorite thing was to turn a new one over to its cover, open it to the back, and read the opening sentences from one, Anthony Lane, who writes movie reviews. Years and years, I never once went to see a movie.
Every once in a while, a flicker would catch me off guard. A barely visible thread between New Yorker corpus and my non-Pulitzer results. In both corpuses, each piece was different from its sisters, but also…not quite. The similarities echoed. And I knew that those in my work had emerged from a repetitive process.
In 2017 I began to meditate on the threshold that separates the writing that it feels formulaic one that can be written explicitly like a formula.
The argument is this: the volume of repetitions hints at a (typically tacit) form of algorithmic decision-making. But procedural repetition leaves traces. Trace fingerprints to expose the procedure; decrypt the algorithm; and the software practically writes itself.
In my last job I didn't write much anymore. My software was.
In principle, companies can learn enough about their own flows to make huge profits, but few bother. People seem much more enthralled with what some other is doing.
For example, my bosses, and later my clients, kept wishing their staff could emulate the EconomistThe style of the house. But how would you find which steps Economist Does it take it to end up sounding like it does?
Enter text analysis
read a single Economist article, and feels happy and safe. Read a lot of them and they sound pretty similar. A complete print magazine comes out once a week. Yes, I was betting on the process.
For fun, let's apply a readability function (measured in years of education) to several hundred Economist articles. Let's also do the same with hundreds of articles published by a frustrated European asset manager.
Next, let's get a histogram to see how those readability scores are distributed.
Just two functions and look at the information we get!
Notice how far apart the curves are; this asset manager is No sounding like him Economist. We could dig deeper to see what is causing this disparity. (To begin with, it is often crazy long sentences.)
But also notice how the Economist puts a strict limit on the readability score they allow. The curve is inorganic, which reveals that they apply strict readability control in their editing process.
Finally, and many of my clients struggled with this, the Economist promises to write clearly enough that the average high school student can digest it.
I was expecting these graphs. He had scribbled them on a piece of paper. But when a real one lit up my screen for the first time, it was as if language itself had been laughed at.
Now, I wasn't exactly the first on the scene. In 1964, statisticians Frederick Mosteller and David Wallace appeared on the cover of Time magazine, its forensic literary analysis resolve a 140 year old debate about the authorship of a famous dozen essays written anonymously.
But forensic analysis always analyzes the unique element in relation to two corpora: the one created by the suspected author and the null hypothesis. Comparative analysis is only concerned with comparing bodies of text.
Creating a text analysis engine
Let's retrace our steps: given a corpus, we apply the same function to each of the texts (the readability function). This assigned the corpus to a set (in this case, numbers). In this set we apply another function (the histogram). Finally, we did it with two different corpora and compared the results.
If you squint, you'll see that I just described Excel.
What looks like a table is actually to pipeline, creak columns sequentially. First along the column, followed by functions on the results, followed by comparative analysis functions.
Well, I wanted Excel, but for text.
Not strings: text. I wanted to apply functions like Count Verbs
either First Paragraph Subject
either First Important Sentence
. And I had to be flexible enough to be able to ask any question; Who knows what would end up mattering?
In 2020 this type of solution didn't exist, so I built it. And boy, this software doesn't “practically write itself”! To make it possible to ask any question required some good architectural decisions, which I got wrong twice before I fixed the problems.
In the end, functions are defined once, so they do with a single input text. Then, select the steps of the process and the corpora on which they act.
With that, I started a writing technology consulting company, End Text. I planned to build while working with clients and see what stuck.
What the market said
The first business use case that occurred to me was social listening. Market research and surveys are big business. Now we are at the height of the pandemic, everyone is at home. I thought that processing active conversations in dedicated online communities could be a new way to access customer thinking.
Any first software client would have felt special, but This It was exciting, because my concoction actually helped real people get out of a jam:
Working ahead of a big event, they had planned to release a flagship report, using data from a paid YouGov survey. But its results were lukewarm. So, with the remaining budget, they purchased a FinText studio. It was our findings that put front and center of their final report.
But social listening didn't take off. Investment land is peculiar because money reserves will always need a house; The only question is who owns it. Most industry people I spoke to wanted to know what their competitors were doing.
So the second use case, competitive content analysis, received a warmer response. I sold this solution to about half a dozen companies (including e.g. Aviva Investors).
The whole time, our engine was collecting data that no one else had. Such was my cleverness that it wasn't even my idea to hold training sessions, a client first asked me for one. That's how I learned that companies like to buy training.
Otherwise, my steampunk version of writing was proving a tough sell. Everything was too abstract. What I needed was a dashboard: pretty charts, with real numbers, pulled from live data. A pipeline did the math and I hired a small team to make the pretty graphs.
Within the board, two charts showed a breakdown of topics and the rest looked at writing style. I will say a few words about this election.
Everyone believes that what they say is important. If others don't care, it really is a moral failure, to weigh style over substance. A bit like bad taste is something only others have.
Scientists have counted clicks, tracked the eye, monitored scrolling, and timed attention. We know that it takes readers a split second to decide if something is “for them,” and they decide by loosely comparing new information to what they already like. The style is an entry pass.
What the panel showed
Before, I hadn't been tracking the data that was being collected, but now I had all those nice graphs. And they were proving that I had been right and at the same time very, very wrong.
At first, he only had direct knowledge of a few large investment firms and suspected that his competitors' flows were very similar. This turned out to be correct.
But I had also assumed that slightly smaller companies would have only slightly less production. This is simply not true.
Text analysis was useful if a company already had writing production capacity. Otherwise, what they needed was a functioning factory. There were very few companies in the first group, because everyone else was crowding the second.
Epilogue
As a product, text analytics has been a mixed bag. It made some money, it probably could have made a little more, but it was unlikely to become a runaway success.
Furthermore, he had lost his appetite for New Yorker. At some point everything leaned too much towards the formulaic side and the magic disappeared.
Words are now in their wholesale era, with great language models like ChatGPT. At first I considered applying pipes to discern if the text is machine generated, but what would be the point?
Instead, in late 2023 I started working on a solution that helps companies scale their ability to write for expert clients. It's a completely different adventure, still in its infancy.
Eventually, I came to think of text analysis as an extra pair of glasses. Sometimes the blur becomes clear. I keep it in my pocket, just in case.