Large Language Models (LLMs) and generative AI such as ChatGPT and GPT4 are the new hype.

Even though the legal world usually takes a conservative "wait and see" approach, many law firms and inhouse legal teams have already jumped on the LLM-train, as we can see from the number of signups we have witnessed ourselves since releasing our AI-driven full-document generation module.

When we talk to law firms about implementing this technology, one question consistently pops up: “is it possible to customise GPT with our own documents, i.e. to feed our own legal knowledge into GPT?”

This is a very legitimate question. Even though GPT4 is able to pass the US bar exam with flying colours and can draft very good contractual clauses, its knowledge of local or specialised legislation is limited, at least compared to its incredible language skills. Also, as everyone knows, the training of GPT stopped at the end of 2021, meaning GPT is unaware of any newer information.

Where does GPT's legal knowledge come from?

GPT version 3 incorporates data found in a combination of books, Wikipedia, and a cleaned-up version of the "Common Crawl" collection (a non-profit that collects content from publicly available websites). Roughly half of this information is in English.

Other LLMs include a variety of legal sources, such as European legislation (EURlex), discussions in the EU parliament, case law from the European Court of Justice and Court of Human Rights, as well as US court cases and US contracts such as the SEC's Edgar database (also easily accessible through ClauseBuddy).

The training sources of GPT4 are not known, because — unlike what its creator's name ("OpenAI") would suggest — there is limited transparency about GPT4. In the report published at its release, it is even mentioned that for competitive and safety reasons, no details are provided about the data sources and training method.

Local, specific and non-English legislation are underrepresented in these sources, so it should not come as a surprise that LLMs often produce suboptimal results when factual data from such sources would be required. The same applies to case law, because it really depends on your jurisdiction whether case law is actually openly available, despite many governments' efforts towards "open data". Legal doctrine is obviously even worse: in terms of volume, most publicly available legal doctrine consists of diverse sources such as blogs and newsletters.

While there is an ongoing technical debate about whether new LLMs should aim for more data or instead more intensive training on existing data, the general assumption is that more data will lead to higher-quality LLMs. However, despite the terabytes of information available, there is some fear that as early as 2026, the developers of LLMs will soon run out of additional high-quality sources of information. While terabytes of publicly available low-quality information (such as discussions on various online forums) will continue to be produced every day, additional sources of high-quality good information are much harder to get by once you have exhausted sources such as Wikipedia.

This seems all the more true for legal knowledge, which is highly fragmented across jurisdictions, languages and legal specialised domains. Let's also not forget that the number of online legal discussion forums is dwarfed by the number of publicly available discussions in other domains, such as politics, sports and even software development. And while law firms happily distribute countless blogs and newsletters about legal topics, most of those texts tend to be deliberately superficial.

One can therefore assume that most of the truly high-quality legal knowledge is stored behind the digital bars of law firms and legal teams, within their contracts, memos and legal briefs.

Wouldn't it be great if you could train an LLM to ingest all that existing knowledge in those documents, so as to create your own "LawGPT"?

To answer that question, we have to look at five different methods to teach an LLM new tricks.

Method 1: Directly submitting new information

The easiest method is to directly give information inside the prompts you submit. Such "few-shot learning" works very well, because LLMs have become quite good at dealing with human language, and will therefore "understand" the examples or new information you will be feeding them. For example, if you paste the text of some statute in the prompt of the LLM, and ask a few questions about it, then chances are high that an advanced LLM such as GPT4 will be able to provide a correct answer, based on its "understanding" of text it was provided with. This way, lawyers can indeed "teach" an LLM about legal evolutions.

The first downside to few-shot learning is mainly practical: LLMs do not have any "memory", so with each new question you ask them, they will be clueless about previous questions, and will therefore have to re-learn everything. This is annoying, but end-user software can overcome this hurdle for you, by automatically repeating previous information in your interactions with the LLM. (In fact, ChatGPT gives the impression that it can remember information throughout a chat session, but technically speaking it will actually re-read the entire conversation, including its own answers, when you ask it a new question.)

The biggest downside to directly submitting new information, is that prompts are limited in size. GPT3 and ChatGPT have a maximum of about three thousand words (about 7 pages), while the standard version of GPT4 goes up to six thousand words (15 pages), and a special but much slower version of GPT4 goes up to 24 thousand words (60 pages).

Given the legal community's tendency to produce long texts, this is of course problematic. The US constitution does not fit within GPT4s standard prompt if you would include all its amendments, while the EU's GDPR legislation is roughly twice the maximum size the special version of GPT4 can handle. And this is not counting your actual question, which also counts towards the maximum number of words allowed in a prompt.

Method 2: Training a new LLM

It should therefore not come as a surprise that many want to create and train their own LLM. Is it really so difficult to have software process the gigabytes of data from a law firm?

As every data scientist will tell you, the problem is not so much in the core algorithms and software packages — those are well-known and readily available, even for free as open source software. Training a new LLM is really an engineering problem that requires massive computing power.

Essentially, these algorithms consist of the software playing billions of "guessing games" with itself, where one version of the software will deliberately mask a few words in a sentence, which some other version of the software will then have to guess. With every new guess, the software will adapt its internal parameters and takes another guess, so as to eventually become a very good text guesser that is very good at generating (guessing) new texts.

All these billions of guesses require enormous processing power. It is not a coincidence that OpenAI received significant investments from Microsoft, in the form of a supercomputer running in Microsoft's cloud environment. Even with such supercomputer, it takes months to train an LLM such as GPT3 or GPT4.

Of course, unlike OpenAI, law firms are not in the business of selling LLMs to other customers, so it is perhaps tempting to assume that a single law firm can get by with much less training. Problem is, understanding language and basic facts about the world (e.g., "a shirt fits within a suitcase, but not the other way around") really requires a lot of training, so that limit to just legal texts will result in software with limited reasoning knowledge and language skills.

Moreover, even training on just a few gigabytes of text, requires significant computing power, that has to be temporarily rented from third parties. You will require special servers that have dedicated processors (so-called "GPUs") that can handle the massive guessing games, and these easily cost several thousands of euros/dollars per month.

Aside from the cost of computing resources, there's also the required technical knowledge. Software developers that know something about text mining have been hard to find in the last couple of years, with salaries that often mimic those from Big Law lawyers. Within that limited pool of software developers, those that have practical engineering experience with training a new LLM are truly rare.

As a result, there is enormous risk involved in this kind of project. In other words, except if you are truly adventurous with lots of risk appetite, forget about developing your own Harvey. There's a reason they have a waiting list and a very close relationship with OpenAI.

Method 3: Finetuning an existing LLM

Fortunately, there is a third option: so-called "finetuning". It is a light training process, where you submit several pairs of prompts (e.g. questions) and proposed answers to the LLM.

The LLM will then take that information, play the guessing games with that information for several hours, and then create a "fine-tuned version" of itself. An LLM such as Cohere's large models or OpenAI's GPT3 (ChatGPT and GPT4 cannot yet be used for this) could thus serve as a foundation model: you start with the existing language skills and basic world knowledge, and then add your own specific legal knowledge — e.g., a few hundred recent legal memos about German finance law.

Finetuning requires no special hardware or hardware engineering skills, because you use the existing computer infrastructure from the LLM provider (e.g., OpenAI, Microsoft, Cohere or Amazon). Furthermore, while the finetuning process it is not free of charge — if you use Microsoft to finetune GPT3, you pay about 77 EUR / 85 USD per hour — finetuning generally takes several hours or days, so the training cost should remain manageable for a mid-size or higher law firm.

Even so, depending on the provider you choose, you may also have to add additional storage costs for storing your finetuned model. Referring again to the example of Microsoft, these storage costs can easily reach 1,500 EUR per year, per finetuned model. When you also take into account that you will probably have to finetune a model per jurisdiction and legal department, and that finetuning often has to be completely repeated from scratch when you add additional information, you easily arrive at a resource cost of more than 5,000 EUR per year per legal domain, per jurisdiction.

This is not exactly cheap, but seems doable, right?

Wrong. Everyone jumps on finetuning as the holy grail of custom LLMs, but the reality is that finetuning remains a bit of a mystery, with many more failures than successes. If you visit OpenAI's user discussion forums, you will read countless frustrations from users who have given up after many tries and costly experiments. Many users want to achieve an outcome that is, from a certain distance, similar to what a law firm or inhouse legal department would like to achieve. But despite many tries, these users' results seems significantly worse than what the standard version of GPT3 (let alone GPT4) generates.

OpenAI itself is quite ambiguous about the process. For example, its technical manual provides several case studies where finetuning was successfully used to increase the quality of GPT's output. Conversely, its more practical "cookbook" warns that, although fine-tuning can feel like the more natural option to add knowledge, it is generally not recommended as a way to teach the model knowledge. Instead, fine-tuning is better suited to teach specialised tasks, and is less reliable for factual recall. Applied to legal tasks, finetuning is for example really good for extracting facts (e.g. by providing questions such as "Which party names are mentioned in this paragraph?" or "What is the total current duration of this renewable contract", with correct sample answers that the LLM can learn from). In such context, you do not teach new information to the LLM; instead, you teach the LLM how to properly deal with certain tasks for which it should already have sufficient information.

Due to the technical setup of finetuning, you have to split your knowledge into prompts and answers. Even when you think about prompts as "questions", this format is not a very natural fit for most information in the legal world, where we usually draft contracts, memos, legal opinions and legal briefs, which consist of entire documents. Those full documents would then have to be split and converted into question/answer pairs. This is doable for some types of legal memos (e.g., "What is the notice period to be respected by the employer if ...?") and individual contract clauses (e.g., "Liability clause with a maximum exposure of ...") but problematic for documents where there is a certain logic connecting the different sections.

Then there's another problem with finetuning. A common misunderstanding is that finetuning is a bit like adding a traditional lookup-database to an LLM, but the reality is completely different. In a traditional database, when you search for information, there will be a precise lookup and exact reproduction of that information; if you perform a search query today, it will give you exactly the same information tomorrow.  

Conversely, when an LLM is trained (or finetuned) on a certain text, it will not store the exact details of a text. Instead, it will store small fragments of text, together with millions of mathematical connections to other fragments of text. When you ask an LLM to retrieve certain information, it will perform an adhoc reconstruction of that information — often resulting in a slightly different outcome every time you ask the question. LLMs are therefore much more similar to human brains than traditional databases are; for example, when you ask me to recall when I visited New York, I may "reconstruct" that knowledge by linking my visit to New York to the year I got my new job or new car, and depending on when or where you ask me the question, I may give a slightly different answer.

This "black box" nature of LLMs is also the reason why it is not possible to easily add new information (e.g., knowledge about some new legislation, or some additional memos you've written) to an LLM. In most cases, you have to perform the costly and time-consuming finetuning process all over again.

Given this background, it should not come as a surprise that LLMs will "confabulate" or "hallucinate" facts: they are simply not the best tools for retrieving factual knowledge. Instead, you should use them for what they are excellent at — understanding and manipulating text.

Method 4: Semantic Search

Which actually brings us back to the first option: inputting new information via prompts. Currently, the best option for law firms seems to consist of the combination of an advanced database with the language capabilities of an LLM.

The idea is that you first filter all your available information down to a limited number of items that fit within the prompt of an LLM. Then you ask the LLM to read those items and answer your question.

At the outset, the use of a "normal" database immediately provides you with several benefits, as compared to LLMs: speed and access rights. Storing and retrieving information in a normal database is cheap and fast, and you can easily specify access rights, so that certain items can only be retrieved by certain members of a team. Moreover, not to be underestimated, normal databases are very easy to update: unlike LLMs, there is no complete retraining or re-finetuning necessary each time you would like to add additional information.

For example, suppose you would like to compose a legal memo about the GDPR's possibility to make use of the "legitimate interests" legal ground for processing personal data. As a first step, you would consult your internal database with legal memos to find relevant pieces of text that you previously stored there, and that talk about "legitimate interests". Within the (potentially hundreds of thousands of) fragments of text stored in your legal memos database, the database would for example narrow down the results to 20 results. Those 20 results would then be individually submitted to the LLM, asking the LLM to assess whether each item is relevant for the question at hand. (In light of an LLM's relative slowness, there is a good reason why you really want to narrow down the results as much as possible, otherwise this process would take forever, not forgetting the LLM usage costs.)

Your drafting software and LLM will then engage in a kind of ping-pong session. The drafting software will increasingly narrow down the relevant items, and then at some point instruct the LLM to draft a new text on the basis of the relevant items.

Three observations emerge from this process.

First, the human will need to stay in the loop. As the recent experiments with Auto-GPT and BabyAGI show (where software packages will intensively communicate with an LLM), humans really need to give some direction in these automated processes between independent software agents, otherwise the results will very quickly get off track, because the LLM will have chosen unsuitable directions.

Second, this kind of setup is still much too experimental, costly and slow to be actually useful. In our own experiments here at ClauseBase, we have been combining our drafting engine with GPT4 in this ping-pong fashion, but the results are far from being usable, and in fact much lower than what you can achieve in a more human-driven process (what we call "hybrid drafting", where you deliberately decide which parts to include from your own knowledge bank, and which parts to have drafted by GPT). And that's probably a very good thing, because we are not convinced that such fully automatic processes are suitable for the legal world.

Third, you really need to store your information in a good database, and preferably one that is as advanced as possible, allowing legal experts (and, in the future, independently acting software) to look up information quickly and effectively. The era of chaotically organising your information is now truly coming to an end.

Even traditional databases (such as Sharepoint) may be getting too limited. One interesting improvement is the use of a semantic search engine that will also find information that is semantically close to your search query, instead of searching literally for keywords. The way this technically works, is that individual terms will be converted into numbers that represent associations between words, kind of like a "cloud" of words.

Even if the database had no prior knowledge of a certain language, it will be able to determine that words such as "liability" and "damages" seem related to each other, because it would notice that across thousands of example documents, these words would be used close to each other.

If this sounds similar to how an LLM stores information, then you are right. In fact, LLMs can be of great help here, because you can submit a piece of text to an LLM, and it would give you back all the numbers ("vectors") associated with that text, which you can then store in your semantic database. Later on, when a user would search in the database, you would ask the LLM to convert the user's search query into a vector. Your semantic database can then easily look up text fragments that are semantically close to that vector and give back the closest results.

Method 5: LLM plugins

A last possibility is to make your internal legal database available as a "plugin" for an LLM. Also in this case there will be a ping-pong session between your internal database and the LLM, but then reversed, as the LLM will be the ping-pong master.

The idea is that you can actually teach an LLM that for a few specific areas, it should get its information from a third party source, instead of generating text or finding factual knowledge itself. OpenAI has published a specification in this regard for ChatGPT, which for example allows ChatGPT to rely on the advanced mathematical Wolfram engine to do advanced calculations. Each time ChatGPT concludes that it needs to such a calculation, it will reach out to Wolfram to get an answer. ChatGPT will then integrate that answer and continue generating the rest of the text.

Granting LLMs access to your own internal legal database is obviously the next step. In the technical specification to the plugin, you will tell the LLM that if it needs factual information about certain legislation, it will first need to query your internal legal database for that information, for example using a set of keywords or the "vectors" (numbers) described above.

In this entire process, the LLM is in the driving seat: using the capability specifications of your plugin, it will decide whether or not to actually reach out to your internal database. This is a much simpler design than having a drafting engine or human steer this process (essentially you only have to tell the LLM how your internal database should be accessed and queried), but obviously you lose quite some control over the process.

Conclusion

We are probably in the middle of the Gartner hype cycle, with increasingly inflated expectations of what the new technology is capable of.

LLMs are extremely good at manipulating legal texts, and that is what you should be primarily using them for. Even though there are many examples online where as GPT4 falls short, the reality is that it is very suitable for handling legal tasks under the supervision of a knowledgeable legal expert. Law is a profession of words, and GPT4 is simply really good at that.

However, the technology has its limitations, and dealing with legal knowledge is one of them, particularly for specialised legal subject areas and small jurisdictions. Adequately feeding information to GPT4 is therefore the holy grail, but there is no silver bullet yet. "Finetuning" an LLM is certainly is not that silver bullet, although it has its merits for some use cases.

Based on the interactions we have with law firms, it seems that many are claiming to be already performing such "finetuning" on their knowledge. In reality, however, we have strong suspicions that within the legal world, finetuning is currently still a bit like teenage sex, which everybody is claiming to do but very few actually do. Indeed, as soon as our conversations with legal innovation managers go deeper than the surface, we notice that there really is a lot of fear, uncertainty and doubt.

What is clear, however, is the interesting paradox that despite the promise of being able to automate nearly everything, the curation and organisation of legal knowledge is more important than ever. It does not matter whether you take the avenue of training, finetuning or prompting (with or without semantic search): garbage in means garbage out, and simply throwing all your old stuff on a giant heap, is not going to help you. Legal texts will be generated at a much higher pace than in the past, but this also means that errors in your databases will do more damage than in the traditional drafting process where words were more carefully selected, and facts more thoroughly checked.

The technology is of course moving very fast, so you can expect that at least some of the limitations described in this article will have been solved by next year. However, any legal expert who has seen a demo of what GPT4 is capable of, will be aware that a profound shift is taking place in the legal drafting place. So far, knowledge management has been a very weak area for most legal teams, but now is the time to start acting. Your clients and the new generation of lawyers will not wait for you.

Want to get more information? Join our practical webinar on the 9th of May. Or get in touch with us to discuss the tools we offer to build your own advanced legal database, and be prepared for the evolutions of tomorrow. You can also already use our hybrid drafting tools today, and combine your own legal content with the output from LLMs, thanks to the advanced layout engines of Clause9 and ClauseBuddy.