What’s Up With Open Source LLMs?

Maggie Basta

Apr 11, 2023

By now, we seem to all have accepted that generative AI will drastically change everything. What remains unclear, however, is if OpenAI’s latest release will still seem like the second coming of Christ by the time we reach GPT v1000. Will foundation models become commoditized across a few key providers? Or will most companies be running their own in-house models in the next few years?

There are substantial arguments to be made for any of these outcomes. GPT seems to get better in ways we hadn’t even previously thought of with each new release. The fact that GPT4 now performs better than 90% of humans on the LSAT feels both amazing and disturbing. But we have also started to see open-source projects nipping at OpenAI’s heels in what some have called LLM’s “Stable Diffusion Moment.”

There are obvious benefits to using an open source model. Privacy and security, affordability, customization, and avoiding lock-in are major considerations for enterprises and areas where open source stands to win. Contingent on their ability to compete reasonably on quality, these factors make open source models hard to ignore. So, to really understand and demonstrate where things stand today, we’ve decided to try several approaches and models ourselves by building an email generator – the use case that seems to resonate with everyone.

As thoughtful venture capitalists who highly value our personal interactions with companies (and since I value my employment as an associate), we would never dream of automating our own outreach emails to founders (I promise I’m not kidding, I wake up at dawn every day to email the companies nearest and dearest to my heart, please reply to me). Instead, we’ve decided to build an email generator for founders that replies to outreach from VCs and politely tells us to piss off. Afterall, what am I but a humble servant to our portfolio and prospective companies.

The task is simple. You, a founder, receive the usual email from the VC wanting to connect. You just raised, or you just want them to leave you alone right now, so you need to give them the standard “we’re heads down building” or “best to check back in a couple quarters” yadda yadda. Do these emails really require so much thought? Probably not. So, why not build a basic extension to call an LLM that can acknowledge the email, make the VC blush, and send them on their way. Lucky for you, we’ve just built one to reject ourselves.

Email Responder

The Method

We tried a number of different models and approaches including GPT4, Vicuna, Alpaca, and a fine tuned version of Alpaca. The primary prompt we used was the following:

You are a helpful assistant that helps startup founders reply to Venture Capitalists that reach out to them to politely decline requests for meetings. The replies you generate should decline the meeting, notify the venture capitalist that the founder is not fundraising any time soon and that it would be best to check back down the line. Include some personalization in the response and don’t make it sound overly formal. Here are a few example responses:

Example 1: {…}

Example 2: {…}

Please generate the response to this email:

{….}

The examples are randomly selected at runtime from a dataset of the previous rejections I have received (and yes, I did us all a favor by weeding out the ones that have particularly hurt my feelings in the past). This method, as opposed to hard-coding examples, is to add randomization in the case that you are unhappy with the first generation and want to rerun.

They usually – *cough* always – look something like this:

Example 1: Hey X, appreciate you reaching out! We are very heads down on product at least through the end of this year. Would be happy to circle back when the timing is better on our end.

Example 2: Hey X– I think we’re still a few quarters from thinking about fundraising, so let’s check in again in a few months.

GPT4 was run via the OpenAI API and the other models were run and/or tuned on a machine with 8 A100 40GB GPUs. Our fine-tuning dataset for Alpaca consisted of approximately 1,000 emails, from both my own inbox and synthetically generated responses. While 1,000 might seem paltry, it was functional for our purposes here.

The Results

**Disclaimer: in case we have misled you, a reminder that this is not actually a scientific paper. Our results were measured by our eyeballs and our eyeballs only. We know that one email is not indicative of the overall system performance and this is simply for demonstration purposes.

In this example scenario, venture capitalist me (Marguerite) reaches out to founder me (Maggie). Founder Maggie is the CEO of the automated VC rejector tool Maggie.ai.

Outreach email

GPT4

Unsurprisingly, GPT4 is shockingly good at this. We pass a portion of the original prompt as a system message and the rest as the user. The response has a tone that is friendly, but not excessively so, and it makes subtle references to the original email without robotically repeating it.

GPT Response

Vicuna

In case you missed it, Vicuna is one of the open-source chatbots that was created by fine-tuning LLaMA on user-shared conversations.

Performance here was also good. The personalization isn’t really at the same level, but this is definitely sendable.

Vicuna

Here is an example if you append “please make personal references to the original email” to the original prompt:

vicuna with ammended prompt

The system does as it is told, but the personalized references are a bit parroting and robotic.

Alpaca 7B

A little rude and goes a bit rogue at the end there…

Alpaca7b

Fine-tuned Alpaca

It’s a bit short and still a little curt, but it gets to the point. You can also subtly notice the change in the tone right. BUT this one gets an asterisk – see below.

Fine-tuned alpaca

**The quality of responses were a bit erratic, and parts of the original prompt often showed up in the response itself. This was also an issue with the original Alpaca model, but it seemed to be more pronounced here. With some prompt adjustment, and a larger and higher quality dataset, however, getting this to where it needs to be seems feasible.

Implications

While this was certainly entertaining, it was also an extremely eye-opening activity. There were several key takeaways we came away with:

We are off to the races

Yes, in many ways, OpenAI is really just that good, and there are two fronts where GPT continues to blow our minds. The first is quality: the ability of GPT4 to follow instructions to a tee, cover a massive breadth of use cases, and reliably output quality prose is nothing short of amazing. The second is ease of use: the barrier to entry GPT has for a developer to get an initial integration into their applications is unbelievably low. POCs have become something that can be knocked out between meetings.
Open source is also really freakin’ good. Are these models at the same level as GPT? No. Some might argue that they aren’t nearly as close as we make them out to be. By virtue of the fact that many are trained on the outputs of GPT itself, they in theory have to lag. But, considering Alpaca came onto the scene less than a month ago, it is remarkably impressive. Moreover, the speed of this innovation is far from the only relevant piece here. Alpaca, Vicuna, and many other open source models like GPT–2 and GPTJ are all <100 GB, and many have <10GB versions, which are small enough to run on a standard laptop. Further optimizations like the use of low-rank adaptation (LoRA) and quantization only continue to chip away at compute requirements. Running state-of-the-art LLMs on personal devices (any maybe even the browser!) seem like not such a distant reality.

Challenges remain. So do opportunities

First of all, managed infrastructure is … nice. Shocking. Getting a proper compute environment was a miserable game that consisted of begging for EC2 limit increases, overly-optimistic bets of running on inadequate amounts of memory, insufficient AWS capacity errors, and spending too much money. Not to mention the deployment and serving requirements you will be responsible for once you actually need these models in production. Obviously, this is probably easier if you are an actual company with a real engineering team and not a semi-washed-up developer turned VC, but I would like to believe that this is at least somewhat challenging for the broader population. In a world where running some form of your own models becomes commonplace, there will be a huge demand for optimized infrastructure and managed solutions.
Getting the right data and integrating it properly is key. Alpaca and Vicuna make the importance of instruction tuning clear. They also make the possibility of training customized, subdomain models a more visible reality. But, if you want a small model you can run yourself for something specific (like responding to emails), you will need to get that dataset from somewhere, and the prompt-response format needs to be done well. Fortunately, synthetic data tools have become increasingly available, and many enterprises do sit on enormous amounts of unstructured, proprietary data anyways. Figuring out the best way to incorporate that data is highly dependent on the use case, and far from a determined answer.

Remaining questions

A world where open source and proprietary models become commonplace requires thinking about a laundry list of additional questions. Some that are particularly front of mind for us:

What is the optimal strategy for where information is stored and how it is retrieved?

How much do you simply put into the prompt? Is it best to embed it into the model itself by tuning? Is it best to just use a vector database to decouple the process of information retrieval and answer synthesis? When should you let chaining take the reins? For our mini project, it is very easy to simply include everything you need in the prompt, but if you are a law firm using AI to help you work on a case for example, you cannot pass the entirety of federal law into a prompt.

What is the golden ratio of performance?

Determining the ideal model and information retrieval system will require weighing tradeoffs between size, cost, latency, and performance. Comparing Google’s Bard vs GPT4 is a perfect example.

How do you measure the performance of your model and constrain it?

The age-old question of LLM testing. If companies build their own models, they assume the responsibility of benchmarking it and making sure it doesn’t go off the rails.

All of these questions are much longer discussions. The short answer is that they are all highly dependent on the use case and the various requirements of the company in question.

Companies to watch

There were more open source projects and hosting platforms than we could possibly have tried ourselves. However, this project made the opportunities for disruption at both the compute and modeling layer abundantly clear. Below is a list of exciting projects and companies we are watching:

market map

What now?

If you really feel so inclined, you can find the code for the simple GPT4 version of the extension here. This version assumes you have GPT4 access but can easily be changed to use 3.5 if needed. It also hard codes a small set of examples in the prompt in place of the sample dataset since I didn’t feel like publicly exposing examples of every time I get rejected.

While I sincerely hope no one will actually be using it to reply to me, building this tool provided an extremely valuable glimpse into the current state of open-source LLMs and the requirements for running one. Yes, GPT is fantastic, but less than two months after the llama weights were leaked, we are already seeing open source models produce output that would have had us jumping out of our seats a year ago. No, our mini project is not revolutionary, exhaustive, or scientifically justified, but if one hobbyist developer can get functional output from these models in a mini POC within a week, it’s hard to believe that we won’t see these models integrated into real business products in the next 12 months. The opportunities for development in this space are immense.

References

LLMs

What’s Up With Open Source LLMs?

Maggie Basta

JUMP TO:

The Method

The Results

GPT4

Vicuna

Alpaca 7B

Fine-tuned Alpaca

Implications

We are off to the races

Challenges remain. So do opportunities

Remaining questions

What is the optimal strategy for where information is stored and how it is retrieved?

What is the golden ratio of performance?

How do you measure the performance of your model and constrain it?

Companies to watch

What now?

References

Related posts

Subscribe for updates

Scale Chatbot

What’s Up With Open Source LLMs?

Maggie Basta

JUMP TO:

The Method

The Results

GPT4

Vicuna

Alpaca 7B

Fine-tuned Alpaca

Implications

We are off to the races

Challenges remain. So do opportunities

Remaining questions

What is the optimal strategy for where information is stored and how it is retrieved?

What is the golden ratio of performance?

How do you measure the performance of your model and constrain it?

Companies to watch

What now?

References

Related posts

Where’s my AI banker? Why financial services is lagging on AI adoption

AI’s double-edged sword

Announcing our investment in Abacum

Subscribe for updates