Automatic clause databases: the new fairy-tale?

There's an interesting new type of legal technology on the market: automatic clause databases. They cater to that nagging feeling of having written a clause many months ago, that would be perfect for re-insertion into a contract today, if only you were able to find it easily…

Automatic clause database products (such as Draftwise, Henchman and Syntheia) crawl through a stack of existing documents, and automatically extract relevant clauses from them. The simplicity of these products and the zero setup-time seem exactly what busy legal professionals need. What’s not to like?

Update July 2023: after publishing this blog, we kept receiving requests from customers who want to use the original module described below. In the end, instead of stubbornly preaching why it's really a bad idea, we decided to make this module available to everyone. If, after reading this page, you still want it, then get in touch to try it out yourself! You'll probably be amazed initially due to the lack of effort required — but be warned that this wears off, and you'll appreciate why quality clause libraries are so much better!

Killing our own product

Our own product, ClauseBuddy, takes an entirely different approach. It allows you to build a high-quality library of carefully curated clauses. While offering various features to speed up the clause uploading process, it strongly encourages you to manually curate clauses, to avoid the garbage-in / garbage-out problem.

Allured by the interesting sales pitch of automatic clause databases, some of our customers were suggesting us to integrate automatic clause extraction into ClauseBuddy. In summer 2022, we therefore started the development of this new feature, which we called the “Haystack”.

Fast forward 6 months, to December 2022. As you can see in the teaser video we produced for our early users, our Haystack feature automatically ingested thousands of existing MS Word or PDF files, extracted relevant clauses from them, and allowed you to search for those clauses in a matter of seconds.

Then some loyal users started testing it… and suddenly we realised that all the theoretical fears we had were indeed true. We had created a product that was perfect in theory, yet poor for legal practice.

Read on to learn why this kind of software does not work, and why we ultimately killed a feature we spent months developing.

Splitting is hard

Suppose you were manually building a clause library, and you were confronted with the following clause:

Assuming you think it is wonderfully written and you want to include it in your clause library, how would you deal with it? Keep it as a whole and insert it as-is, because that’s the nature of such a “miscellaneous” clause? Or store each of the individual paragraphs? Or perhaps store both?

Second example. How would you deal with a clause entitled “Obligations of the distributor”, spanning four pages in total, containing five different subtitles? Keep it as a whole? Split it at the level of the subtitles? At the level of each of the individual obligations?

The answer to these questions is “it depends”. This requires a combination of legal experience and some judgement calls. Advanced artificial intelligence cannot make this assessment for you (yet).

Keywords are not key

As everyone knows, the trick to using Google-style keyword searches is a clever combination of good keywords.

A good keyword is ideally repeated a few times in a text, and is relatively unique. Problem is, unlike newspaper articles or books, contract clauses are written using a deliberately limited vocabulary of words (such as “confidentiality”, “obligation”, “agreement”, “liable”, etc.) which are then repeated over and over again.

A very specific (combination of) words work well, e.g. “garden leave” in employment contracts or “escrow” in software contracts. Conversely, searching for a wide keyword such as “confidentiality” is a bad idea, because it is used in very diverse clauses. Such keywords have low "information density”, so searching for them leads to noisy results.

Perhaps paradoxically, the best keywords are those that are not explicitly written down in a clause. For example, in corporate law, a “Texas shootout clause” will never literally contain the words “Texas” or “shootout”, even though such words would be extremely good keywords.

Why then does Google provide such excellent results with a simple keyword search? The secret lies in user feedback. Even when you use keywords with low information-density — e.g. a search for “weather today”— Google can provide excellent search results, because billions of people have previously submitted this search phrase. Users have clicked on one of the resulting pages, and then stayed on the resulting page for a minute, or instead immediately tried a next webpage. Based on this user behaviour, Google can learn which webpages were good or bad for certain search phrases.

Unfortunately, this will never work for private clause databases. Even large law firms will not even reach 0.0001% of the volume of Google’s user feedback, so clause search tools will never get even close to Google’s quality when dealing with low information-density contract clauses.

Result: with a typical keyword search, you will end up with hundreds of clauses of a very diverse nature. What you really want, is to be able to automatically filter clauses on their context (e.g. a clause in a Share Purchase Agreement) or category (e.g., a confidentiality clause). This is exactly what a manually curated clause library will bring you, but is there a possibility for software to automatically include context?

Clause titles to the rescue!

Titles are a great source of condensed information when searching for legal doctrine, so perhaps this also works for clause titles?

Yes, but only a little bit:

Clause titles tend to be short and vague, because they often cover diverse topics.
They are often shared between heterogenous contracts — e.g., the “whereas” clauses of a lease and sales agreement are entirely different.
Clever drafters turn the titles into a legal game, deliberately hiding the most important clauses behind a meaningless title. Some drafters even deliberately "hide" contentious paragraphs, e.g. by putting a general liability limitation into some subsection of a confidentiality clause.
Many interesting clauses will not carry a title at all. Sometimes due to laziness, sometimes due to clever drafting (avoid to draw attention), sometimes because of the way the contract happens to be structured.

Poor metadata

Automatic clause databases will also use the limited other information they can find: a document’s “metadata”. As you probably know, each MS Word and PDF file also stores interesting information about the file — such as the filename, author, date of creation, department name, and so on. The idea is to use that “metadata” to be able to filter clauses.

However, that metadata is even more fragile than the clause titles. When you inspect a random Word-document’s metadata, you will notice that very often, the “author” happens to be the person who wrote the initial version of the document many years ago. Similar story with the filename. Sometimes it is a good indicator, sometimes not so (“Acme – XYZ deal”), sometimes really bad (“Lease Deal final 20234443 final.rev HNH.version3.final. 4.FILEID8907789.docx.docx”). Even the date is unreliable, because even the slightest change in the document (e.g., recalculating a TOC) will cause MS Word to store a new date in the metadata.

Automatic clause databases may also try to extract the title of an agreement (usually found somewhere on the first page), because it can be a reasonable way to put at least some clauses into different buckets, e.g. to allow an end-user to differentiate between a “Payments” clause in a distribution agreement from an employment agreement. Similarly, if an automatic clause database gets access to the name of the client or the name of the project (e.g., because the document was extracted directly from a content management system such as iManage), the database may give you a bit of useful information. Still, even then, such information is only relevant for legal experts who were themselves involved in the project for that client.

No useful categories

If keyword searches are difficult and clause titles and metadata are so fragile, you may wonder which other tricks automatic clause databases have up their sleeves. Perhaps automatic clause categorisation?

The idea is to automatically assign clauses to different categories, such as “assignment of intellectual property” or “applicable law”. A legal expert would then be able to limit the keyword search to one particular category.

Clause categorisation is somewhat similar to what contract reviewing software such as Kira and Luminance do: automatically finding problematic clauses — such as a change of control clause — to for example speed up a due diligence process.

The performance of contract reviewing software depends substantially on its training, i.e. providing it with thousands of labelled examples from which it can learn. For example, when fed thousand examples of problematic warranties clauses, the software can learn to identify such clauses, in order to recognise future other examples. Unfortunately, this is a costly process that is highly specific for combinations of languages and legal domains. For example, in the Contract Understanding Atticus Dataset (CUAD) research project, clauses of about 500 commercial contracts were used for training, with an estimated cost of about 2 million USD if done by a commercial company.

Compared to contract reviewing software, automatic clause databases face a much more difficult categorisation problem, as they are dealing with an open-ended list of hundreds of possible clause categories, all depending on the legal subject matter, jurisdiction and language. Where the training process is already questionable for contract reviewing software, it will be completely unrealistic for automatic clause databases. And, of course, the “automatic” aspect would be completely forfeited…

But let’s assume that automatic clause categorisation somehow works, so that an end-user can easily limit searches to “assignment clauses” or “payment terms”. Does it really help the end-user to search in thousands of such example clauses per category?

No. What is really needed, is a list of sub-categories within each main category. For example, for a duration clause, the sub-categories could be fixed duration, undefined duration, fixed initial term with automatic renewal, fixed initial term requiring explicit renewal, and so on.

Unfortunately, software cannot automatically assign hundreds of versions of the same clause to good sub-categories. In computer science, this is the problem of “clustering”, and even the most advanced clustering algorithms (which take into account the meaning of words, so that synonyms are treated similarly) have a disappointing performance.

For example, have a look at the clustering results produced by the Lingo3G product, for a set of 55 “Force Majeure” clauses. This software is highly specialised in clustering, e.g. to automatically categorise patent applications. Yet for clustering contract clauses, the results are almost useless — the main categories are a mix of topics, and within each main category there is a chaos of subtopics.

*You can interactively try this out yourself, by downloading* *the Excel file we used, and then* *uploading it* *as a “Local File” with the “Lingo3G” algorithm.*

‍

These poor results should not be a surprise when you know that those algorithms have to juggle with paragraphs containing very similar vocabulary. Even advanced new algorithms that take into account the “semantic” meaning of words (such as Word2Vec, GloVE, BERT and ElMo, as well as ChatGPT) struggle with the low information-density problem of typical contract clauses.

And this is not the end of the problem, because we have been focusing on what is explicitly written in clauses. Unfortunately, that’s only half the story.

What is not written?

Anglosaxon contracts tend to be longer and more explicit than European-style contracts, because the latter can omit a lot of information that is already provided by the applicable Civil Code or similar legislation. For example, a typical French or German contract will omit the obligations relating to the good-faith execution of the contract and simply rely on the fallbacks of the Civil Code. Indeed, what is not written in a contract, can be as important as what is explicitly written.

Understanding the differences between those different situations requires years of experience and good knowledge of the law, legal doctrine and case law. Yet in the current state of the technology, software can only learn from what is explicitly written down.

Clause overload

Due to the unreliable titles and metadata, searches in automatic clause databases will typically lead to hundreds of results, except when a user can find a really good keyword combination. This will require a significant amount of time and mental effort to go through.

You may think that having hundreds of clauses to choose from is better than having no clauses at hand. Perhaps so, but what we often witnessed with our test users were situations of analysis paralysis and the paradox of choice, particularly with junior team members. Faced with so many options on the screen, they simply fear making the wrong choice.

Context is king

Probably the biggest issue of all is the lack of context and guidance in automatic clause databases, because they present clauses in isolation, automatically extracted from the original document. Even trivial clauses — e.g. your run-of-the-mill confidentiality clause — tend to contain between ten and thirty different legal “features”, i.e. elements or nuances that may or may not be present in the clause (e.g., for a confidentiality clause, "duration of the obligations", "are the obligations mutual or unilateral?", etc).

While an experienced legal expert can easily enumerate about ten different features, some will not come to mind so easily. Furthermore, let’s not forget that the clauses are presented in isolation: even when some features are missing in the clause presented on the screen, they may have been present elsewhere in the original contract, e.g. in the definition list.

If an experienced legal expert drafted a certain document and has a good memory, then perhaps these problems will be non-issues. Or perhaps you have very tight internal quality procedures. But for all other users — particularly the junior team members — an automatic clause database will be full of legal landmines, waiting to explode due to all the missing context.

Clean-up required

The “zero setup” promise leads to another problem of automatic clause databases: the amount of clean-up required if the clause is effectively inserted into the contract.

In a curated clause library you will be incentivised to remove client-specific and deal-specific facts (e.g., the contact details of the parties, pricing, dates, etc.). Without such cleaning, the users of the library will have to manually remove this information each and every time they insert the clause, which not only requires time, but also may cause quality issues if the user forgets to clean some elements.

Despite the zero setup-time required at the start, automatic clause databases will make you pay with your time, every single time you search or insert a clause.

Conclusion

Automatic clause databases have a few compelling use cases, but they will not bring you all the time-savings you hope, as they will require you to comb through piles of results with every search, undermining the zero setup-time you were promised.

You cannot blame legal drafters that they want to believe the sales pitch for these products, because everyone is looking for shortcuts and productivity hacks, and the idea of automatic clause databases is so appealing. This was exactly the reason why we also developed such a product — only to realise that the discrepancy between the expectations we raised, and the daily results we could deliver, was simply too wide.

Automatic clause databases do work really well for one particular use case: searching ("sniping") for a very specific clause in an old document, on the basis of either a combination of unique keywords, or using the client or project’s name. For such “sniping” searches, automatic clause databases will be unmatched, and will be able to present you with the right clause in a matter of seconds. In a small team of experienced lawyers that know each other's drafting style and past projects, and can correctly assess the legal merits of isolated clauses, this can work quite well.

However, when we asked our test-users to keep track of the number of true sniping-occasions, they reported an average of about once every three weeks, while we had personally expected this number to be quite higher. Only one lawyer reported an average of more than once per week, but this lawyer remarked that she actually did not need an automatic clause database for it. And she was probably right: if you remember the name of the client/project and some unique keyword, then most content management systems — or even the built-in search functionality of Windows and Mac — will also lead you to the right file in a few seconds.

Epilogue

So here we were, having built an automatic clause database feature (“haystack”), only to kill it after initial user feedback.

After all the criticism, you are probably thinking that we gave up and once again focused exclusively on building software for curated clause libraries.

No. We actually went back to the drawing board, and designed Clause Hunt, which is an alternative spin to the idea of an automated clause database. The Clause Hunt feature is also zero-setup based, also ingests thousands of existing documents, and also allows you to search for interesting clauses using keywords. However, unlike true automated clause databases, Clause Hunt does not automatically extract clauses for you, isolated from their original context. Instead, it will visually present you relevant matching paragraphs, so you can quickly jump between potentially interesting clauses.

This solves the splitting part and the missing context problems, while allowing users to perform easy clause sniping, all zero-setup based. However, we did not solve all the other problems described above — including the general warning about potential compliance issues (see our own detailed analysis, as well as the analysis by law firm Timelex).

But Clause Hunt is truly intended to be combined with a curated clause library, so that each time you find a good clause in an old document, you are nudged to store that clause in your curated library. The solution is not perfect, but unless if you can wait for another twenty years until artificial intelligence can truly reason about the legal content of clauses, this is the best of both current worlds.

‍