Before we dive into building the system, it’s crucial to ensure that we have all the necessary tools and knowledge. Here’s a list of prerequisites:
First, we need to install necessary Python packages. These packages provide the tools and functions that we’ll need throughout this project. To install the packages, you need to open your terminal and paste the below command:
pip install openai langchain pinecone-client unstructured
At this stage, we’re going to utilise the Unstructured PDF Loader and the Recursive Text Splitter. The former is used to load our book’s PDF, and the latter is employed to split the text into manageable chunks. Splitting the text will make it easier to handle and process.
from langchain.document_loaders import UnstructuredPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter # Load the PDF loader = UnstructuredPDFLoader('the_field_guide_to_data_science.pdf') documents = loader.load() # Split the text text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) texts = text_splitter.split_documents(documents)
For this step, we’ll use OpenAI’s powerful models to convert the text chunks into numerical vector representations—which are more commonly referred to as embeddings.
import openai from langchain.embeddings import OpenAIEmbeddings # Set your OpenAI API key openai.api_key = 'YOUR_OPENAI_API_KEY' embeddings = OpenAIEmbeddings()
In the code snippet above, be sure to replace ‘YOUR_OPENAI_API_KEY’ with an actual API key from OpenAI.
The next step involves the use of Pinecone, which is an exceptional platform specializing in managing vector data. We will be storing our embeddings in Pinecone, and this would allow us to perform efficient similarity searches subsequently.
import pinecone # Initialize Pinecone pinecone.init(api_key='YOUR_PINECONE_API_KEY', environment='YOUR_PINECONE_ENVIRONMENT') # Create an index index_name = 'data-science-book' if index_name not in pinecone.list_indexes(): pinecone.create_index(index_name, dimension=1536) # Ensure the dimension matches your embeddings # Connect to the index index = pinecone.Index(index_name)
While implementing the above lines of code, replace ‘YOUR_PINECONE_API_KEY’ and ‘YOUR_PINECONE_ENVIRONMENT’ with your actual Pinecone API key and environment.
Moving on to the next step, we will compute embeddings for different chunks of text and afterward, upload these computed embeddings to Pinecone. This is done so that later on, we can make efficient queries based on text similarity.
from tqdm.auto import tqdm # Prepare data for indexing for i, doc in enumerate(tqdm(texts)): # Compute embedding embedding = embeddings.embed_query(doc.page_content) # Create a unique ID id = f'doc-{i}' # Upsert into Pinecone index.upsert([(id, embedding, {'text': doc.page_content})])
At this stage, we’re about to construct a function that will take in user queries, utilize Pinecone to find the most similar text chunks (based on the embeddings), and then use OpenAI to generate comprehensive answers from these related text fragments.
def answer_query(query): # Embed the query query_embedding = embeddings.embed_query(query) # Search Pinecone for similar text chunks result = index.query(query_embedding, top_k=5, include_metadata=True) # Extract the relevant texts contexts = [match['metadata']['text'] for match in result['matches']] # Build the prompt for OpenAI prompt = ''.join(contexts) + f"Using the information above, answer the following question:{query}" # Let OpenAI generate the answer response = openai.Completion.create( engine='text-davinci-003', prompt=prompt, max_tokens=200, temperature=0.7, ) answer = response.choices[0].text.strip() return answer
After setting up everything, it’s now time to test our system by asking it a question related to our book’s content. This would allow us to assess its performance and verify that it’s working as expected.
query = "What are the key qualities of a successful data scientist?" answer = answer_query(query) print("Question:", query) print("Answer:", answer)
Once you’re comfortable with the process of querying large documents using AI, you might find these additional resources helpful for further exploring the possibilities: