All pages
Powered by GitBook
1 of 3

Possibilities and Future Directions

The past few years have seen a significant surge in the usage of AI applications, more than any tech-based service in the recent past. Organisations of all types, and across nearly every industry adopting AI for its access to new insights and creative potential. The Deloitte Tech Trends 2023 report states that in the year 2023-24 alone, over 6.5 lakh employees in India have been expected to undergo upskilling programs related to Generative AI. The same report goes on to state that between 2023 to 2030 the GenAI market in India alone is expected to grow by a CAGR of 24%. At its current pace of development, the global market for conversational AI is set to reach US$14bn by 2025 and ~US$32.6bn by 2030.

Despite these strides being made, Deloitte’s state of AI in India report has found in a survey of over 200 executives across industries and business functions, of the various use cases of GenAI being applied to automate or streamline routine operational practices, its usage for legal document review still had the fewest adopters (at 33% of the surveyed participants). This presents a considerable opportunity for entrepreneurs to build a reliable solution that leverages AI, for routine business operations, knowledge management, marketing, HR, IT and the plethora of use-cases that’ll emerge.

Potential Technological Advancements

This section details some of the experiments that were carried out that helped make crucial decisions on how the Jugalbandi stack has been configured. The RAG framework has been central to these experiments, given its pivotal role in increasing the reliability of the information provided by any AI solution, including Jugalbandi. These experiments have been carried out through technical evaluation frameworks, legal professionals and law school volunteers.

Experimenting with RAG Evaluation Frameworks

Ragas, an open-sourced evaluation framework to test the reliability of answers generated from any RAG pipeline was used to test Jugalbandi. This process involved the following steps:

Credit: Arun Murugan

  1. Synthetic dataset creation: Ragas used OpenAI LLMs to generate a varied set of questions from a predefined dataset. These questions could be one of three: simple, questions whose answers require reasoning, and questions whose answers require information from different contexts. The dataset stores the corresponding context from which the questions were generated, and refers to it as the ‘ground truth’.

  2. Jugalbandi answer generation: The questions generated in the synthetic dataset are passed to Jugalbandi to generate answers. Jugalbandi retrieves relevant chunks (or embeddings) of information from the knowledge base using its similarity search algorithms. These chunks are then passed to a generation model (such as GPT-3.5 or GPT-4o) to produce the answers.

  3. Comparison of Jugalbandi’s answers with the ground truth: In the evaluation of Jugalbandi’s RAG pipeline against Ragas’ ground truth, four key metrics were used to assess performance:

  • Answer Relevance: How relevant the generated answer is to the question.

  • Faithfulness: How accurately the answer reflects the information in the context (ground truth).

  • Context Recall: The ability to retrieve relevant chunks from the knowledge base.

  • Context Precision: The precision of the retrieved chunks in providing the correct answer.

Results: The metrics are analysed to determine the effectiveness of the RAG pipeline. For example, in one experiment which used a legal information knowledge base, with 998 questions, Jugalbandi achieved 85.45% context precision and 92.74% context recall. These results indicate a high level of accuracy in retrieving and generating answers based on the provided knowledge base.

Experiment limitations: While the evaluation framework is a useful tool to assess the performance of Jugalbandi’s chunking and retrieval strategies, some limitations that couldn’t be tested are:

  1. Question Variety: The synthetic dataset mostly includes straightforward questions, whereas user questions can vary significantly in complexity and phrasing. This limitation can affect the evaluation's comprehensiveness. Citizen centric applications will tend to get more open-ended questions, where determining the intent of the user and relating it to the available knowledge base is not always a straightforward task.

  2. Model Dependency: The quality of answers depends on the underlying LLM (e.g., GPT-3.5 or GPT-4). Smaller models like Phi-3 may not perform as well for complex questions, although they are more lightweight and can be deployed on devices with limited resources. Testing with the same LLMs being used both for Jugalbandi and Ragas means that the context may be very similar for the generated and evaluation datasets. In the experiments it was found that answers generated using GPT-4 tended to yield better results in their evaluation, as compared to when GPT-3.5 was used. While this may be attributed to the advances made in the LLM, it’s unclear how it affects the assessment using automated frameworks.

  3. Limited to legal information: The experiments conducted with the evaluation framework were limited to legal datasets. With advances being made in context specific RAG pipelines, the performance metrics generated when evaluating answers from a legal dataset may not apply to other contexts.

  1. Need for manual checks by SMEs: The answers generated by Jugalbandi were compared with the ‘ground truth’ as determined by the frameworks, which in turn is generated using Open AI’s embedding models. It is preferable for the comparison to be made with the ground truth as determined by a subject matter expert for any given context.

  2. Inaccurate Relevance Scores: Evaluation frameworks sometimes overestimate how closely an answer matches the full context. For example, if an answer is based on only a small part of the provided information, Ragas might still give it a high score, which can be misleading.

While this evaluation framework provides valuable insights, it is one component of what must be a much more comprehensive testing process. This highlights the need for extensive testing of the RAG pipeline by experienced professionals or individuals familiar with the knowledge base.

pRAGyoga - RAG Benchmarking

To address the challenges of AI evaluation, pRAGyoga aims to create a comprehensive legal benchmarking dataset in India. This dataset will be used to test RAG systems on key metrics such as recall, accuracy, and possibly the cost-effectiveness of each RAG pipeline. Recall measures the system's ability to retrieve all relevant information, while accuracy assesses the correctness of the information retrieved.

The creation of this dataset will involve a diverse group of students from various disciplines, including law, social sciences, gender studies, data sciences, and humanities. This approach brings diverse and realistic questions that purely computational methods might overlook. Students can generate varied, real-world queries and provide detailed feedback, which enhances the system's reliability. Additionally, using a crowd-sourced method to create benchmarked data ensures that it reflects a wide range of perspectives, making it a strong standard for evaluating AI systems.

Implementation: The program has conducted pilots with a handful of students, with a scale up with multiple institutions planned. The students go through a diverse set of legal documents, including acts, case laws, judgements etc. This is followed by the following steps:

  1. Using open-source annotation tools, the student volunteers generate questions from the provided documents and annotate the parts of the document containing the answers to their questions.

  2. Students input their questions into the evaluation tool, which uses retrieval methods to pull up relevant information from the documents in its knowledge base. Typically, the system retrieves five chunks of information for each query.

  3. Students review the retrieved information to check for relevance. If the information was relevant, they highlighted the parts. If not, they added necessary details in a free-text box.

Quality Control: Considering that evaluating the quality of evaluations of each student would be difficult as the program scales, the following measures were undertaken:

  1. Peer-Review: Once the annotations are complete, different students review the work of their peers. They check if the questions make sense and if the answers match the questions. They can flag or edit the answers as needed.

  2. Expert Review: Questions and answers flagged or edited by students undergo a final review by experts, who either confirm the annotations or make further edits to ensure accuracy.

Results of the pilot: Initial pilots showed that students could create a diverse and relevant set of questions and answers, as compared to the straightforward and pointed questions generated by automated evaluation frameworks. The exercise has been valuable in highlighting areas where the bot performs well and identifying aspects that need improvement.

Limitations: While this exercise is beneficial, it is time-consuming and depends heavily on the quality of student annotations.Comparing student-generated answers with Jugalbandi's responses is done through additional LLM evaluations, which may introduce some subjectivity, when an LLM is used in the evaluation of another LLMs performance.

Jugalbandi Studio: Building with AI has never been more accessible

Technology around us is rapidly evolving, making the need for intuitive and user-friendly platforms more critical than ever. This is especially true for technology that has the potential to improve lives across various sectors by enhancing access to information and enabling action. Simplifying access to such technology for developers and functional experts is necessary to reduce dependence and roadblocks in development. This is the rationale that inspired the creation of Jugalbandi Studio.

Jugalbandi Studio is an intuitive and user-friendly application designed to simplify bot creation and implementation meant for users with minimal technical expertise. It features a chat-based interface where users can functionally articulate their bot requirements without worrying about the technical know-how.

How is Jugalbandi studio different from Jugalbandi manager?

Jugalbandi Studio is designed to make bot creation accessible to users with minimal technical expertise by offering an intuitive, chat-based interface and guided setup. It simplifies the process of defining bot requirements, customising functionalities, and visualising user journeys, enabling anyone to develop and deploy bots without deep technical knowledge. In contrast, Jugalbandi Manager acts as a managed service that caters to developers, providing a robust, scalable framework for building and managing sophisticated AI-powered bots. It supports multiple platforms and can integrate with any LLM or services, allowing for advanced customization and scalability suitable for larger user bases and more complex applications.

Core Capabilities of Jugalbandi Studio

It has a user-friendly Interface

Jugalbandi Studio features an intuitive chat-based interface where users can describe their bot requirements in natural language. This approach removes the barrier of needing technical knowledge, allowing anyone to create a Jugalbandi service tailored to their needs. For example, users can simply type, "I need a bot to help farmers find government schemes," and the Studio will guide them through the setup.

It allows customization

Jugalbandi Studio includes a tool for developing flow diagrams. These diagrams provide a visual representation of interaction paths, making it easier to structure the chatbot’s functionalities and user journeys. For instance, users can visualise how a farmer might interact with the bot to find relevant schemes and follow-up actions.

You can visualise the interaction paths/user journey

Jugalbandi Studio also has a tool that can develop flow diagrams. These diagrams provide a visual representation of the interaction paths, making it easier to structure the chatbot’s functionalities and user journeys.

Realtime Testing and Debugging

Users can test and debug their applications in real-time, ensuring smooth functionality before deployment. This feature allows users to simulate interactions and make necessary adjustments on the fly before the bot goes live.

It also contains a template Library

The platform offers pre-built templates for common use cases, speeding up the chatbot development process. For example, templates for customer service bots, educational bots, and informational bots are available for quick customization.

It can guide you with best practices

Jugalbandi Studio provides initial guidance on best practices, helping users create effective and efficient bots from the start. This feature is particularly useful for users new to chatbot development, offering tips and examples to ensure optimal performance.

Deployment and Monitoring

Studio facilitates easy deployment of chatbots and provides tools for monitoring their performance and usage. Citizen developers and SMEs building bots using Studio can track interactions and gather analytics to improve their bots over time.