Vector Database

Understanding AI Training vs RAG for Business Applications

Provides a deep technical explanation of Retrieval Augmented Generation (RAG), including chunking, embeddings, and vector databases.

Transcript:

Hey, I didn’t realize how long this video was going to be. So buckle up. So today’s video is a little bit deeper. One of the things that customers talk to us about when we start talking about AI is they say that they want to train the AI with their data. And I just wanted to touch on this point because the terminology train is quite specific. So, I think that the AI or a language model is a really expensive process. It takes tens of thousands of hours of processing time, hundreds of thousands of pieces of content. And for most organizations, they’re probably never going to train their own model using their own data. Well, for 2025 anyway, that may change in the future. So, instead, what we can do is we can augment the language models with our own data. And so, I just wanted to touch on that concept of training versus augmenting. And I just wanted to explain it in a way using a pattern that’s quite common today. And the patterns call retrieval, augmented generation or RAG. And I’m going to do my best to try and explain how it works. So let’s say that you have a PDF document and you want to be able to query the PDF document using a language model in some way. So in this diagram, this represents a PDF document. Now to make it useful, that is, converted into something a language model can understand, we need to break it down into small sections and then convert those sections into numeric patterns that the language model is going to understand. That process is called chunking and embedding. So you would take a really big document and you would break it down into small sections, chunks. And you get control over how big those chunks are. Then we do what’s called embedding. Embedding is where we convert each of those chunks into a set of numbers. And those numbers represent the words in each chunk and where they exist in relation to all of the known words. And importantly, the dimensions for those words. Now dimensions are important. Dimensions allow us to take a single word and describe it in a lot of different ways. So if I talk the word Apple, it could be described as a fruit and it could be described as green or red. But the term Apple also represents a company. So it could be described as technology. And so different dimensions of words allow us to understand the context of those words. And what the embedding is doing is it’s taking each of the words from our chunk and representing it in a number of different dimensions. We then take that information and we store it in what’s called a vector database. So now at this point, we have a document or many documents broken down into these embeddings. The vector database is storing the original chunk of content right alongside the numeric pattern that represents the embeddings. Okay, so at this point, we have a lot of numbers in a database. So let’s now look at the other side of the equation. That is, what I want to ask a question about the document, how does that work? Well, the process is actually very similar. We take the query and it needs to be chunked and broken down and it’s embedded also in the same way. And then we do what’s called a similarity search. And this is where the embeddings from the query are compared to the embeddings that we already have an our vector database. The information is then combined and a prompt is built. So looking at a slightly more specific example, the query and the chunks that were found in the vector database are combined into a prompt that’s then sent off to the AI. Now let me show you with a more specific example. What I’ve done here is I’ve built a visualization tool that allows us to play with different chunking techniques so we can physically see what’s going on and allows us to visualize how those chunks are broken down into the different embeddings and dimensions. So the piece of text that we’re using is this here. It’s a paragraph that’s just being grabbed. At the top we have the different types of chunking that we can use. So I’m just going to use fix size chunks for the minute. And so what it’s saying is basically we’re going to use 50 character chunks and that breaks this content down into each of these different chunks. But we can model or we can play with the different sizes. So for example, we can increase the size of the chunks and you can see by doing so we get less chunks. You notice so that when we look at this, say, for example here, earth is our home planet, but the word planet is split across two different chunks. That’s a problem because if we’re looking for the word planet in our search, then this won’t come up. So we also have this concept of overlap. An overlap is where chunks are created, but the tail of one chunk and the beginning of the next chunk are combined together. So you can see that the first chunk has everything that O and the second chunk starts with everything that O in terms of orbit. Now this allows us to fix those problems where we would otherwise get that breakup in different words. So that visualizes how the chunking works. We then have what’s called the embedding. The embedding is where each of those chunks is then broken down numerically into different dimensions. Now, in this case, I’ve actually labeled these dimensions. In reality, there are generally around 1,000 to 1,500 dimensions that are managed inside the language model. Sometimes more. And the dimensions really don’t have names, but I’ve added names for the sake of clarity. So what this is saying is that let’s say if we look at the position slash order dimension, you can see here that the first chunk scores pretty poorly. But this chunk here scores highly and this scores very high. So this scoring mechanism is where each of the chunks is measured against all of these dimensions. Okay, so now we’ve got a bunch of numbers in our chunks. So let’s use a sample query. What’s the red planet? So the query is also embedded. And you can see on the same dimensions, that same score is provided. A search is then done where we match the scores against each of those dimensions. You can see here that chunk number eight has been scored with 92% similarity because across a series of these dimensions, the query matches the content chunk. 12 scored well chunk nine scored well. So what we’ve now got is we’ve got three chunks that have been identified in relation to the terms, what is the red planet? So what happens now? How do I actually get my answer? Because the answer isn’t the moon Mars often called the red planet has captured. So what we now do is we take the original query and we take the chunks of content that we’ve found and we assemble that into a prompt. You’re a helpful assistant that answers questions based on the provided context. Here is the context. So here are my search results and here’s the query. What’s the red planet? And we’re saying answer the user’s query based on this context, be concise and accurate. And this is how we use retrieval augmented generation to be able to search through our own content. I recognize that this is a relatively complex process, but I wanted to do my best to try and explain how it works. So the next time someone asks can we train our own model, the answer is probably no and instead you might want to use retrieval augmented generation.