Potential Technological Advancements
Last updated
Last updated
This section details some of the experiments that were carried out that helped make crucial decisions on how the Jugalbandi stack has been configured. The RAG framework has been central to these experiments, given its pivotal role in increasing the reliability of the information provided by any AI solution, including Jugalbandi. These experiments have been carried out through technical evaluation frameworks, legal professionals and law school volunteers.
Experimenting with RAG Evaluation Frameworks
Ragas, an open-sourced evaluation framework to test the reliability of answers generated from any RAG pipeline was used to test Jugalbandi. This process involved the following steps:
Synthetic dataset creation: Ragas used OpenAI LLMs to generate a varied set of questions from a predefined dataset. These questions could be one of three: simple, questions whose answers require reasoning, and questions whose answers require information from different contexts. The dataset stores the corresponding context from which the questions were generated, and refers to it as the ‘ground truth’.
Jugalbandi answer generation: The questions generated in the synthetic dataset are passed to Jugalbandi to generate answers. Jugalbandi retrieves relevant chunks (or embeddings) of information from the knowledge base using its similarity search algorithms. These chunks are then passed to a generation model (such as GPT-3.5 or GPT-4o) to produce the answers.
Comparison of Jugalbandi’s answers with the ground truth: In the evaluation of Jugalbandi’s RAG pipeline against Ragas’ ground truth, four key metrics were used to assess performance:
Answer Relevance: How relevant the generated answer is to the question.
Faithfulness: How accurately the answer reflects the information in the context (ground truth).
Context Recall: The ability to retrieve relevant chunks from the knowledge base.
Context Precision: The precision of the retrieved chunks in providing the correct answer.
Results: The metrics are analysed to determine the effectiveness of the RAG pipeline. For example, in one experiment which used a legal information knowledge base, with 998 questions, Jugalbandi achieved 85.45% context precision and 92.74% context recall. These results indicate a high level of accuracy in retrieving and generating answers based on the provided knowledge base.
Experiment limitations: While the evaluation framework is a useful tool to assess the performance of Jugalbandi’s chunking and retrieval strategies, some limitations that couldn’t be tested are:
Question Variety: The synthetic dataset mostly includes straightforward questions, whereas user questions can vary significantly in complexity and phrasing. This limitation can affect the evaluation's comprehensiveness. Citizen centric applications will tend to get more open-ended questions, where determining the intent of the user and relating it to the available knowledge base is not always a straightforward task.
Model Dependency: The quality of answers depends on the underlying LLM (e.g., GPT-3.5 or GPT-4). Smaller models like Phi-3 may not perform as well for complex questions, although they are more lightweight and can be deployed on devices with limited resources. Testing with the same LLMs being used both for Jugalbandi and Ragas means that the context may be very similar for the generated and evaluation datasets. In the experiments it was found that answers generated using GPT-4 tended to yield better results in their evaluation, as compared to when GPT-3.5 was used. While this may be attributed to the advances made in the LLM, it’s unclear how it affects the assessment using automated frameworks.
Limited to legal information: The experiments conducted with the evaluation framework were limited to legal datasets. With advances being made in context specific RAG pipelines, the performance metrics generated when evaluating answers from a legal dataset may not apply to other contexts.
Need for manual checks by SMEs: The answers generated by Jugalbandi were compared with the ‘ground truth’ as determined by the frameworks, which in turn is generated using Open AI’s embedding models. It is preferable for the comparison to be made with the ground truth as determined by a subject matter expert for any given context.
Inaccurate Relevance Scores: Evaluation frameworks sometimes overestimate how closely an answer matches the full context. For example, if an answer is based on only a small part of the provided information, Ragas might still give it a high score, which can be misleading.
While this evaluation framework provides valuable insights, it is one component of what must be a much more comprehensive testing process. This highlights the need for extensive testing of the RAG pipeline by experienced professionals or individuals familiar with the knowledge base.
pRAGyoga - RAG Benchmarking
To address the challenges of AI evaluation, pRAGyoga aims to create a comprehensive legal benchmarking dataset in India. This dataset will be used to test RAG systems on key metrics such as recall, accuracy, and possibly the cost-effectiveness of each RAG pipeline. Recall measures the system's ability to retrieve all relevant information, while accuracy assesses the correctness of the information retrieved.
The creation of this dataset will involve a diverse group of students from various disciplines, including law, social sciences, gender studies, data sciences, and humanities. This approach brings diverse and realistic questions that purely computational methods might overlook. Students can generate varied, real-world queries and provide detailed feedback, which enhances the system's reliability. Additionally, using a crowd-sourced method to create benchmarked data ensures that it reflects a wide range of perspectives, making it a strong standard for evaluating AI systems.
Implementation: The program has conducted pilots with a handful of students, with a scale up with multiple institutions planned. The students go through a diverse set of legal documents, including acts, case laws, judgements etc. This is followed by the following steps:
Using open-source annotation tools, the student volunteers generate questions from the provided documents and annotate the parts of the document containing the answers to their questions.
Students input their questions into the evaluation tool, which uses retrieval methods to pull up relevant information from the documents in its knowledge base. Typically, the system retrieves five chunks of information for each query.
Students review the retrieved information to check for relevance. If the information was relevant, they highlighted the parts. If not, they added necessary details in a free-text box.
Quality Control: Considering that evaluating the quality of evaluations of each student would be difficult as the program scales, the following measures were undertaken:
Peer-Review: Once the annotations are complete, different students review the work of their peers. They check if the questions make sense and if the answers match the questions. They can flag or edit the answers as needed.
Expert Review: Questions and answers flagged or edited by students undergo a final review by experts, who either confirm the annotations or make further edits to ensure accuracy.
Results of the pilot: Initial pilots showed that students could create a diverse and relevant set of questions and answers, as compared to the straightforward and pointed questions generated by automated evaluation frameworks. The exercise has been valuable in highlighting areas where the bot performs well and identifying aspects that need improvement.
Limitations: While this exercise is beneficial, it is time-consuming and depends heavily on the quality of student annotations.Comparing student-generated answers with Jugalbandi's responses is done through additional LLM evaluations, which may introduce some subjectivity, when an LLM is used in the evaluation of another LLMs performance.