{"id":15857,"date":"2023-11-27T02:00:07","date_gmt":"2023-11-27T10:00:07","guid":{"rendered":"https:\/\/softwareengineeringdaily.com\/?p=15857"},"modified":"2023-11-20T18:04:32","modified_gmt":"2023-11-21T02:04:32","slug":"building-a-privacy-preserving-llm-based-chatbot","status":"publish","type":"post","link":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/","title":{"rendered":"Building a Privacy-Preserving LLM-Based Chatbot"},"content":{"rendered":"<p id=\"9a53\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">As Large Language Models (LLMs) and generative AI continue to grow more sophisticated and available, many organizations are starting to build, fine-tune, and customize LLMs based on their internal data and documents. This can bring incredible efficiency and reliability to data-driven decision-making processes. However, this practice comes with its share of challenges, primarily around data privacy, protection, and governance.<\/p>\n<p id=\"9315\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Let\u2019s consider the construction of the LLM itself, which is trained on a massive amount of data collected from public and private sources. Without careful anonymization and filtering, sensitive data \u2014 such as PII or intellectual property \u2014 may be inadvertently included in the training set, potentially leading to a privacy breach.<\/p>\n<p id=\"e451\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Furthermore, privacy concerns are introduced when interacting with LLMs, as users might input sensitive data, such as names, addresses, or even confidential business information. If these inputs aren\u2019t handled properly, the misuse or exposure of this information is a genuine risk.<\/p>\n<p id=\"9bca\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">In this post, we\u2019ll explore how to work with LLMs in a privacy-preserving way when building an LLM-based chatbot. As we walk through the technology from end-to-end, we\u2019ll highlight the most acute data privacy concerns and we\u2019ll show how using a data privacy vault addresses those concerns.<\/p>\n<p id=\"8e99\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Let\u2019s start by taking a closer look at the problem we need to solve.<\/p>\n<h1 id=\"d673\" class=\"mh mi ev be mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne bj\" data-selectable-paragraph=\"\">The problem: Protecting sensitive information from exposure by a chatbot<\/h1>\n<p id=\"adea\" class=\"pw-post-body-paragraph lj lk ev ll b lm nf lo lp lq ng ls lt lu nh lw lx ly ni ma mb mc nj me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Consider a company that has uses an LLM-based chatbot for its internal operations. The LLM for the chatbot was built by modifying a pre-existing base model with embeddings created from internal company documents. The chatbot provides an easy-to-use interface that lets non-technical users within the company access information from internal data and documents.<\/p>\n<p id=\"2555\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">The company has a sensitive internal project called \u201cProject Titan.\u201d Project Titan is so important and so sensitive that only people working on Project Titan know about it. In fact, the team often says: the first rule of Project Titan is don\u2019t talk about Project Titan. Naturally, the team wants to take advantage of the internal chatbot and also include Project Titan specific information to speed up creation of design documents, documentation, and press releases. However, they need to control who can see details about this sensitive project.<\/p>\n<p id=\"93e8\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">What we have is a tangible and pressing privacy concern that sits at the intersection of AI and data. These challenges appear extremely difficult to solve in a scalable and production-ready way. Simply having a private version of the LLM doesn\u2019t address the core issue of data access.<\/p>\n<h1 id=\"c1e8\" class=\"mh mi ev be mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne bj\" data-selectable-paragraph=\"\">The proposed solution: Sensitive data de-identification and fine-grained access control<\/h1>\n<p id=\"7990\" class=\"pw-post-body-paragraph lj lk ev ll b lm nf lo lp lq ng ls lt lu nh lw lx ly ni ma mb mc nj me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Ultimately, we need to identify the key points where sensitive data must be de-identified during the process of building (or fine-tuning) the LLM and the end user\u2019s interaction with the LLM-based chatbot. After careful analysis, we\u2019ve identified that there are two key points in the process where we need to de-identify (and later re-identify) sensitive data:<\/p>\n<ol class=\"\" style=\"text-align: justify;\">\n<li id=\"fb56\" class=\"lj lk ev ll b lm ln lo lp lq lr ls lt lu nk lw lx ly nl ma mb mc nm me mf mg nn no np bj\" data-selectable-paragraph=\"\"><strong class=\"ll ew\">Before ingestion<\/strong>: When documents from Project Titan are used to create embeddings, the project name, any PII, and anything else sensitive to the project must be de-identified. This de-identification should occur as part of the ETL pipeline prior to data ingestion into the LLM.<\/li>\n<li id=\"fd5c\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nn no np bj\" data-selectable-paragraph=\"\"><strong class=\"ll ew\">During use<\/strong>: When a user inputs data to the chatbot, any sensitive data included in that input must also be de-identified.<\/li>\n<\/ol>\n<p id=\"38e3\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">You can de-identify sensitive data using Skyflow\u2019s polymorphic encryption and tokenization engine that\u2019s included within Skyflow Data Privacy Vault. This includes detection of PII but also terms you define within a sensitive data dictionary, like intellectual property (i.e. Project Titan).<\/p>\n<p id=\"6bc4\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Of course, only Project Titan team members who use the chatbot should be able to access the sensitive project data. Therefore, when the chatbot forms a response, we\u2019ll rely on Skyflow\u2019s governance engine (which provides fine-grained access control) and detokenization API to retrieve the sensitive data from the data privacy vault, making it available only to authorized end users.<\/p>\n<p id=\"3af7\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Before we dive into the technical implementation, let\u2019s go through a brief overview of foundational LLM concepts. If you\u2019re already familiar with these concepts, you can skip the next section.<\/p>\n<h1 id=\"d670\" class=\"mh mi ev be mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne bj\" data-selectable-paragraph=\"\">A brief primer on LLMs<\/h1>\n<p id=\"bd22\" class=\"pw-post-body-paragraph lj lk ev ll b lm nf lo lp lq ng ls lt lu nh lw lx ly ni ma mb mc nj me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">LLMs are sophisticated artificial intelligence (AI) systems designed to analyze, generate, and work with human language. Built on advanced machine learning architectures, they are trained on vast quantities of text data, enabling them to generate text that is convincingly human-like in its coherence and relevance.<\/p>\n<p id=\"c456\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">LLMs leverage a technology called\u00a0<em class=\"nv\">transformers\u00a0<\/em>\u2014 one example is GPT, which stands for Generative Pre-Trained Transformer \u2014 to predict or generate a piece of text when given input or context. LLMs learn from patterns in the data they are trained on and then apply these learnings to understand newly given content or to generate new content.<\/p>\n<p id=\"74f6\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Despite their benefits, LLMs pose potential challenges in terms of privacy, data security, and ethical considerations. This is because LLMs can inadvertently memorize sensitive information from their training data or generate inappropriate content if not properly regulated or supervised. Therefore, the use of LLMs necessitates effective strategies for data handling, governance, and preserving user privacy.<\/p>\n<h1 id=\"5430\" class=\"mh mi ev be mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne bj\" data-selectable-paragraph=\"\">A technical overview of the solution<\/h1>\n<p id=\"84bd\" class=\"pw-post-body-paragraph lj lk ev ll b lm nf lo lp lq ng ls lt lu nh lw lx ly ni ma mb mc nj me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">When embarking on any LLM project, we need to start with a model. Many open-source LLMs have been released in recent months, each with its specific area of focus. Instead of building an entire LLM model from scratch, many developers choose a pre-built model and then adjust the model with vector embeddings generated from domain-specific data.<\/p>\n<p id=\"e98e\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Vector embeddings encapsulate the semantic relationship between words and help algorithms understand context. The embeddings act as an additional contextual knowledge base to help augment the facts known by the base model.<\/p>\n<p id=\"008d\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">In our case, we\u2019ll begin with an\u00a0<a class=\"af nw\" href=\"https:\/\/huggingface.co\/models?other=LLM\" target=\"_blank\" rel=\"noopener ugc nofollow\">existing model from Hugging Face<\/a>, and then customize it with embeddings.\u00a0<a class=\"af nw\" href=\"https:\/\/huggingface.co\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Hugging Face<\/a>\u00a0provides ML infrastructure services as well as open-source models and datasets.<\/p>\n<p id=\"d15e\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">In addition to the Hugging Face model, we\u2019ll use the following additional tools to build out our privacy-preserving LLM-based ETL pipeline and chatbot:<\/p>\n<ul class=\"\" style=\"text-align: justify;\">\n<li id=\"335e\" class=\"lj lk ev ll b lm ln lo lp lq lr ls lt lu nk lw lx ly nl ma mb mc nm me mf mg nx no np bj\" data-selectable-paragraph=\"\"><a class=\"af nw\" href=\"https:\/\/python.langchain.com\/docs\/get_started\/introduction.html\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"ll ew\">LangChain<\/strong><\/a>\u00a0an open-source Python library that chains together components typically used for building applications (such as chatbots) powered by LLMs<\/li>\n<li id=\"7ab7\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\"><a class=\"af nw\" href=\"https:\/\/www.snowflake.com\/en\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"ll ew\">Snowflake<\/strong><\/a>, which we\u2019ll use for internal document and data storage<\/li>\n<li id=\"0237\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\"><a class=\"af nw\" href=\"https:\/\/docs.snowflake.com\/en\/user-guide\/data-load-snowpipe-intro\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"ll ew\">Snowpipe<\/strong><\/a>, which we\u2019ll use with Snowflake for automated data loading<\/li>\n<li id=\"cecd\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\"><a class=\"af nw\" href=\"https:\/\/www.trychroma.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"ll ew\">Chroma<\/strong><\/a>, an AI-native, open-source database for vector embeddings<\/li>\n<li id=\"450d\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\"><a class=\"af nw\" href=\"https:\/\/streamlit.io\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"ll ew\">Streamlit<\/strong><\/a>, an open-source framework for building AI\/ML-related applications using Python<\/li>\n<li id=\"8efe\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\"><a class=\"af nw\" href=\"https:\/\/python.langchain.com\/docs\/use_cases\/question_answering\/how_to\/vector_db_qa\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"ll ew\">RetrievalQA<\/strong><\/a>, a question-answering chain in LangChain which gets documents from a Retriever and then uses a QA chain to answer questions from those documents<\/li>\n<\/ul>\n<p id=\"81e9\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">The following diagram shows the high-level ETL and embeddings data flow:<\/p>\n<figure class=\"ob oc od oe of og ny nz paragraph-image\">\n<div class=\"oh oi go oj bg ok\" tabindex=\"0\" role=\"button\">\n<div class=\"ny nz oa\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*ZabJ_zyZcCV6BGrb 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*ZabJ_zyZcCV6BGrb 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*ZabJ_zyZcCV6BGrb 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*ZabJ_zyZcCV6BGrb 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*ZabJ_zyZcCV6BGrb 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*ZabJ_zyZcCV6BGrb 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*ZabJ_zyZcCV6BGrb 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" \/><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*ZabJ_zyZcCV6BGrb 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*ZabJ_zyZcCV6BGrb 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*ZabJ_zyZcCV6BGrb 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*ZabJ_zyZcCV6BGrb 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*ZabJ_zyZcCV6BGrb 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*ZabJ_zyZcCV6BGrb 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*ZabJ_zyZcCV6BGrb 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\" \/><img fetchpriority=\"high\" decoding=\"async\" class=\"bg kq ol c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*ZabJ_zyZcCV6BGrb\" alt=\"\" width=\"700\" height=\"210\" \/><\/picture><\/div>\n<\/div><figcaption class=\"om on oo ny nz op oq be b bf z hb\" data-selectable-paragraph=\"\">Example of the ETL and embeddings data flow.<\/figcaption><\/figure>\n<p id=\"f500\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\">The ETL and embeddings flows from end to end are:<\/p>\n<p id=\"2f12\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\"><strong class=\"ll ew\">ETL<\/strong><\/p>\n<ul class=\"\">\n<li id=\"45b6\" class=\"lj lk ev ll b lm ln lo lp lq lr ls lt lu nk lw lx ly nl ma mb mc nm me mf mg nx no np bj\" data-selectable-paragraph=\"\">Start with source data, which may contain sensitive data.<\/li>\n<li id=\"10f2\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\">Send data to Skyflow Data Privacy Vault for de-identification.<\/li>\n<li id=\"22ef\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\">Use Snowpipe to load clean data into Snowflake.<\/li>\n<\/ul>\n<p id=\"a645\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\"><strong class=\"ll ew\">Create vector embeddings<\/strong><\/p>\n<ul class=\"\">\n<li id=\"965d\" class=\"lj lk ev ll b lm ln lo lp lq lr ls lt lu nk lw lx ly nl ma mb mc nm me mf mg nx no np bj\" data-selectable-paragraph=\"\">Load documents from Snowflake into LangChain.<\/li>\n<li id=\"b587\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\">Create vector embeddings with LangChain.<\/li>\n<li id=\"0a49\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\">Store embeddings in Chroma.<\/li>\n<\/ul>\n<p id=\"8458\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\">Once the model has been customized with the Project Titan information, the user interaction and inference flow is as follows:<\/p>\n<figure class=\"ob oc od oe of og ny nz paragraph-image\">\n<div class=\"oh oi go oj bg ok\" tabindex=\"0\" role=\"button\">\n<div class=\"ny nz oa\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*lLHrQlr4o3MhTrDT 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*lLHrQlr4o3MhTrDT 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*lLHrQlr4o3MhTrDT 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*lLHrQlr4o3MhTrDT 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*lLHrQlr4o3MhTrDT 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*lLHrQlr4o3MhTrDT 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*lLHrQlr4o3MhTrDT 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" \/><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*lLHrQlr4o3MhTrDT 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*lLHrQlr4o3MhTrDT 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*lLHrQlr4o3MhTrDT 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*lLHrQlr4o3MhTrDT 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*lLHrQlr4o3MhTrDT 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*lLHrQlr4o3MhTrDT 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*lLHrQlr4o3MhTrDT 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\" \/><img decoding=\"async\" class=\"bg kq ol c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*lLHrQlr4o3MhTrDT\" alt=\"\" width=\"700\" height=\"448\" \/><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"4032\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\">User interaction and inference information flow<\/p>\n<ol class=\"\">\n<li id=\"7f8e\" class=\"lj lk ev ll b lm ln lo lp lq lr ls lt lu nk lw lx ly nl ma mb mc nm me mf mg nn no np bj\" data-selectable-paragraph=\"\"><strong class=\"ll ew\">Chat UI input<\/strong><\/li>\n<\/ol>\n<ul class=\"\">\n<li id=\"746f\" class=\"lj lk ev ll b lm ln lo lp lq lr ls lt lu nk lw lx ly nl ma mb mc nm me mf mg nx no np bj\" data-selectable-paragraph=\"\">Accept user input via Streamlit\u2019s chat UI.<\/li>\n<li id=\"7e71\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\">Send user input to Skyflow for de-identification.<\/li>\n<\/ul>\n<p id=\"94a6\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\"><strong class=\"ll ew\">2. Retrieve embeddings<\/strong><\/p>\n<ul class=\"\">\n<li id=\"eadc\" class=\"lj lk ev ll b lm ln lo lp lq lr ls lt lu nk lw lx ly nl ma mb mc nm me mf mg nx no np bj\" data-selectable-paragraph=\"\">Get the embeddings from Chroma and attach to RetrievalQA.<\/li>\n<\/ul>\n<p id=\"d23e\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\"><strong class=\"ll ew\">3. Inference<\/strong><\/p>\n<ul class=\"\">\n<li id=\"3df7\" class=\"lj lk ev ll b lm ln lo lp lq lr ls lt lu nk lw lx ly nl ma mb mc nm me mf mg nx no np bj\" data-selectable-paragraph=\"\">Send clean data to RetrievalQA.<\/li>\n<li id=\"c87a\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\">Use QA chain in RetrievalQA to answer the user\u2019s question.<\/li>\n<\/ul>\n<p id=\"6378\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\"><strong class=\"ll ew\">4. Chat UI response<\/strong><\/p>\n<ul class=\"\">\n<li id=\"f5cb\" class=\"lj lk ev ll b lm ln lo lp lq lr ls lt lu nk lw lx ly nl ma mb mc nm me mf mg nx no np bj\" data-selectable-paragraph=\"\">Send RetrievalQA\u2019s response to Skyflow for detokenization.<\/li>\n<li id=\"958b\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nx no np bj\" data-selectable-paragraph=\"\">Send re-identified data to Streamlit for display to the end user.<\/li>\n<\/ul>\n<p id=\"858a\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\">Now that we\u2019re clear on the high-level process, let\u2019s dive in and take a closer look at each step.<\/p>\n<h1 id=\"6fce\" class=\"mh mi ev be mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne bj\" data-selectable-paragraph=\"\">ETL: Cleaning the source data<\/h1>\n<p id=\"db93\" class=\"pw-post-body-paragraph lj lk ev ll b lm nf lo lp lq ng ls lt lu nh lw lx ly ni ma mb mc nj me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Cleaning the source data with Skyflow Data Privacy Vault is fairly straightforward and I\u2019ve covered some of this in a\u00a0<a class=\"af nw\" href=\"https:\/\/medium.com\/snowflake\/keeping-sensitive-customer-data-out-of-snowflake-with-skyflow-and-snowpipe-2a65320b66f\" rel=\"noopener\">prior post<\/a>. In this case, we need to process all the source documents for Project Titan available in an AWS S3 bucket.<\/p>\n<p id=\"316b\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Skyflow will store the raw files, de-identify PII and IP, and save the clean files to another S3 bucket.<\/p>\n<pre class=\"ob oc od oe of or os ot bo ou ba bj\"><span id=\"ff1a\" class=\"ov mi ev os b bf ow ox l oy oz\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">import<\/span> boto3\r\n<span class=\"hljs-keyword\">from<\/span> skyflow.vault <span class=\"hljs-keyword\">import<\/span> ConnectionConfig, Configuration, RequestMethod\r\n\r\n<span class=\"hljs-comment\"># Authentication to Skyflow API<\/span>\r\nbearerToken = <span class=\"hljs-string\">''<\/span>\r\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title.function\">tokenProvider<\/span>():\r\n    <span class=\"hljs-keyword\">global<\/span> bearerToken\r\n    <span class=\"hljs-keyword\">if<\/span> is_expired(bearerToken):\r\n        <span class=\"hljs-keyword\">return<\/span> bearerToken\r\n    bearerToken, _ = generate_bearer_token(<span class=\"hljs-string\">'&lt;YOUR_CREDENTIALS_FILE_PATH&gt;'<\/span>)\r\n    <span class=\"hljs-keyword\">return<\/span> bearerToken\r\n\r\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title.function\">processTrainingData<\/span>(<span class=\"hljs-params\">trainingData<\/span>):\r\n    <span class=\"hljs-keyword\">try<\/span>:\r\n        <span class=\"hljs-comment\"># Vault connection configuration<\/span>\r\n        config = Configuration(<span class=\"hljs-string\">'&lt;YOUR_VAULT_ID&gt;'<\/span>, <span class=\"hljs-string\">'&lt;YOUR_VAULT_URL&gt;'<\/span>, tokenProvider)\r\n\r\n        <span class=\"hljs-comment\"># Define the connection API endpoint<\/span>\r\n        connectionConfig = ConnectionConfig(<span class=\"hljs-string\">'&lt;YOUR_CONNECTION_URL&gt;'<\/span>, RequestMethod.POST,\r\n        requestHeader = {\r\n            <span class=\"hljs-string\">'Content-Type'<\/span>: <span class=\"hljs-string\">'application\/json'<\/span>,\r\n            <span class=\"hljs-string\">'Authorization'<\/span>: <span class=\"hljs-string\">'&lt;YOUR_CONNECTION_BASIC_AUTH&gt;'<\/span>\r\n        }\r\n        requestBody = {\r\n            <span class=\"hljs-string\">'trainingData'<\/span>: trainingData\r\n        }\r\n \r\n        <span class=\"hljs-comment\"># Connect to the vault<\/span>\r\n        client = Client(config)\r\n    \r\n        <span class=\"hljs-comment\"># Call the Skyflow API to de-identify the training data<\/span>\r\n        response = client.invoke_connection(connectionConfig)\r\n\r\n        <span class=\"hljs-comment\"># Define the S3 bucket name and key for the file<\/span>\r\n        bucketName = <span class=\"hljs-string\">\"clean-data-bucket\"<\/span>\r\n        fileKey = <span class=\"hljs-string\">\"{timestamp}-{generated-uuid}\"<\/span>\r\n\u200b\r\n        <span class=\"hljs-comment\"># Write the data to a file in memory<\/span>\r\n        fileContents = <span class=\"hljs-built_in\">bytes<\/span>(response.training_data.encode(<span class=\"hljs-string\">\"UTF-8\"<\/span>))\r\n\u200b\r\n        <span class=\"hljs-comment\"># Upload the file to S3<\/span>\r\n        s3 = boto3.client(<span class=\"hljs-string\">\"s3\"<\/span>)\r\n        s3.put_object(Bucket=bucketName, Key=fileKey, Body=fileContents)\r\n    <span class=\"hljs-keyword\">except<\/span> SkyflowError <span class=\"hljs-keyword\">as<\/span> e:\r\n        <span class=\"hljs-built_in\">print<\/span>(<span class=\"hljs-string\">'Error Occurred:'<\/span>, e)<\/span><\/pre>\n<p id=\"bd16\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Next, we\u2019ll configure Snowpipe to detect new documents in our S3 bucket and load that data into Snowflake. To do this, we\u2019ll need to create the following in Snowflake:<\/p>\n<ol class=\"\">\n<li id=\"5417\" class=\"lj lk ev ll b lm ln lo lp lq lr ls lt lu nk lw lx ly nl ma mb mc nm me mf mg nn no np bj\" data-selectable-paragraph=\"\">A\u00a0<a class=\"af nw\" href=\"https:\/\/docs.snowflake.com\/en\/sql-reference\/sql\/create-table\" target=\"_blank\" rel=\"noopener ugc nofollow\">new table<\/a><\/li>\n<li id=\"7465\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nn no np bj\" data-selectable-paragraph=\"\">A\u00a0<a class=\"af nw\" href=\"https:\/\/docs.snowflake.com\/en\/sql-reference\/sql\/create-file-format\" target=\"_blank\" rel=\"noopener ugc nofollow\">file format<\/a><\/li>\n<li id=\"54ec\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nn no np bj\" data-selectable-paragraph=\"\">A\u00a0<a class=\"af nw\" href=\"https:\/\/docs.snowflake.com\/en\/sql-reference\/sql\/create-stage\" target=\"_blank\" rel=\"noopener ugc nofollow\">stage<\/a><\/li>\n<li id=\"fa3c\" class=\"lj lk ev ll b lm nq lo lp lq nr ls lt lu ns lw lx ly nt ma mb mc nu me mf mg nn no np bj\" data-selectable-paragraph=\"\">A\u00a0<a class=\"af nw\" href=\"https:\/\/docs.snowflake.com\/en\/sql-reference\/sql\/create-pipe\" target=\"_blank\" rel=\"noopener ugc nofollow\">new pipe<\/a><\/li>\n<\/ol>\n<pre class=\"ob oc od oe of or os ot bo ou ba bj\"><span id=\"a1a4\" class=\"ov mi ev os b bf ow ox l oy oz\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">CREATE<\/span> <span class=\"hljs-keyword\">OR<\/span> REPLACE <span class=\"hljs-keyword\">TABLE<\/span> custom_training_data (\r\n  training_text <span class=\"hljs-type\">BINARY<\/span>\r\n  );\r\n\u200b\r\n<span class=\"hljs-keyword\">CREATE<\/span> <span class=\"hljs-keyword\">OR<\/span> REPLACE FILE FORMAT training_data_json_format\r\n  TYPE <span class=\"hljs-operator\">=<\/span> JSON;\r\n\u200b\r\n<span class=\"hljs-keyword\">CREATE<\/span> <span class=\"hljs-keyword\">OR<\/span> REPLACE TEMPORARY STAGE training_data_stage\r\n FILE_FORMAT <span class=\"hljs-operator\">=<\/span> training_data_json_format;\r\n\u200b\r\n<span class=\"hljs-keyword\">CREATE<\/span> PIPE custom_training_data\r\n  AUTO_INGEST <span class=\"hljs-operator\">=<\/span> <span class=\"hljs-literal\">TRUE<\/span>\r\n  <span class=\"hljs-keyword\">AS<\/span>\r\n  <span class=\"hljs-keyword\">COPY<\/span> <span class=\"hljs-keyword\">INTO<\/span> custom_training_data\r\n    <span class=\"hljs-keyword\">FROM<\/span> (<span class=\"hljs-keyword\">SELECT<\/span> $<span class=\"hljs-number\">1<\/span>:records.fields.training_text\r\n          <span class=\"hljs-keyword\">FROM<\/span> @ training_data_stage t)\r\n    ON_ERROR <span class=\"hljs-operator\">=<\/span> <span class=\"hljs-string\">'continue'<\/span>;<\/span><\/pre>\n<p id=\"87b1\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">With that, we have raw data that goes through a de-identification process, and then we store the plaintext sensitive data in Snowflake. Any sensitive data related to Project Titan is now obscured in the LLM, but because of Skyflow\u2019s polymorphic encryption and tokenization, the de-identified data has referential integrity, meaning we can return the data to its original form when interacting with the chatbot.<\/p>\n<h1 id=\"c27f\" class=\"mh mi ev be mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne bj\" data-selectable-paragraph=\"\">Creating vector embeddings: Customizing our LLM<\/h1>\n<p id=\"71ff\" class=\"pw-post-body-paragraph lj lk ev ll b lm nf lo lp lq ng ls lt lu nh lw lx ly ni ma mb mc nj me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Now that we have our de-identified text data stored in Snowflake, we\u2019re confident that all information related to Project Titan has been properly concealed. The next step is to create embeddings of these documents.<\/p>\n<p id=\"e243\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">We\u2019ll use the\u00a0<a class=\"af nw\" href=\"https:\/\/huggingface.co\/hkunlp\/instructor-large\" target=\"_blank\" rel=\"noopener ugc nofollow\">Instructor<\/a>\u00a0model provided by Hugging Face as our embedding model. We store our embeddings in Chroma, a vector database built expressly for this purpose. This will allow for the downstream retrieval and search support of the textual data stored in our vector database.<\/p>\n<p id=\"7ccc\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">The code below loads the base model, embedding model, and storage context.<\/p>\n<pre class=\"ob oc od oe of or os ot bo ou ba bj\"><span id=\"a8e4\" class=\"ov mi ev os b bf ow ox l oy oz\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> langchain.chat_models <span class=\"hljs-keyword\">import<\/span> ChatOpenAI\r\n<span class=\"hljs-keyword\">from<\/span> langchain.embeddings <span class=\"hljs-keyword\">import<\/span> HuggingFaceEmbeddings\r\n<span class=\"hljs-keyword\">from<\/span> langchain.embeddings.openai <span class=\"hljs-keyword\">import<\/span> OpenAIEmbeddings\r\n\r\nmodel_id = <span class=\"hljs-string\">\"hkunlp\/instructor-large\"<\/span>\r\nembed_model = HuggingFaceEmbeddings(model_name=model_id)\r\nvectorstore = Chroma(<span class=\"hljs-string\">\"langchain_store\"<\/span>, embed_model)<\/span><\/pre>\n<p id=\"1ca6\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\">Next, we need to load all documents and add them to the vector store. For this, we use the\u00a0<a class=\"af nw\" href=\"https:\/\/python.langchain.com\/docs\/integrations\/document_loaders\/snowflake\" target=\"_blank\" rel=\"noopener ugc nofollow\">Snowflake document loader<\/a>\u00a0in LangChain.<\/p>\n<pre class=\"ob oc od oe of or os ot bo ou ba bj\"><span id=\"0648\" class=\"ov mi ev os b bf ow ox l oy oz\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">from<\/span> snowflakeLoader <span class=\"hljs-keyword\">import<\/span> SnowflakeLoader\r\n<span class=\"hljs-keyword\">import<\/span> settings <span class=\"hljs-keyword\">as<\/span> s\r\n\r\nQUERY = <span class=\"hljs-string\">\"select training_text as source from custom_training_data\"<\/span>\r\nsnowflake_loader = SnowflakeLoader(\r\n    query=QUERY,\r\n    user=s.SNOWFLAKE_USER,\r\n    password=s.SNOWFLAKE_PASS,\r\n    account=s.SNOWFLAKE_ACCOUNT,\r\n    warehouse=s.SNOWFLAKE_WAREHOUSE,\r\n    role=s.SNOWFLAKE_ROLE,\r\n    database=s.SNOWFLAKE_DATABASE,\r\n    schema=s.SNOWFLAKE_SCHEMA,\r\n    metadata_columns=[<span class=\"hljs-string\">\"source\"<\/span>],\r\n)\r\ntraining_documents = snowflake_loader.load()\r\n\r\nvector_store.add_documents(training_documents)<\/span><\/pre>\n<p id=\"76c6\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\">With the training document and vector store created, we create the question-answering chain.<\/p>\n<pre class=\"ob oc od oe of or os ot bo ou ba bj\"><span id=\"2b11\" class=\"ov mi ev os b bf ow ox l oy oz\" data-selectable-paragraph=\"\">qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(temperature=<span class=\"hljs-number\">0.2<\/span>,model_name=<span class=\"hljs-string\">'gpt-3.5-turbo'<\/span>),\r\n                                 chain_type=<span class=\"hljs-string\">\"stuff\"<\/span>, \r\n                                 retriever=vector_store.as_retriever())\r\nresult = qa.run(<span class=\"hljs-string\">\"What is Project Titan?\"<\/span>)<\/span><\/pre>\n<p id=\"e7c6\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">This question (\u201cWhat is Project Titan?\u201d) will fail because the model doesn\u2019t actually know about Project Titan, it knows about a de-identified version of the string \u201cProject Titan\u201d.<\/p>\n<p id=\"ea1a\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">To issue a query like this, the query needs to be first sent through Skyflow to de-identify the string and then the de-identified version is passed to the model. We\u2019ll tackle this next as we start to put the pieces together for our chat UI.<\/p>\n<h1 id=\"6871\" class=\"mh mi ev be mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne bj\" data-selectable-paragraph=\"\">Chat UI Input: Preserving privacy of user-supplied data<\/h1>\n<p id=\"cacc\" class=\"pw-post-body-paragraph lj lk ev ll b lm nf lo lp lq ng ls lt lu nh lw lx ly ni ma mb mc nj me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">We\u2019re ready to focus on the chatbot UI aspect of our project, dealing with accepting and processing user input as well as returning results with Project Titan data detokenized when needed.<\/p>\n<p id=\"608d\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">For this portion of the project, we will use Streamlit for our UI. The code below creates a simple chatbot UI with Streamlit.<\/p>\n<pre class=\"ob oc od oe of or os ot bo ou ba bj\"><span id=\"6e76\" class=\"ov mi ev os b bf ow ox l oy oz\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">import<\/span> openai\r\n<span class=\"hljs-keyword\">import<\/span> streamlit <span class=\"hljs-keyword\">as<\/span> st\r\n\r\nst.title(<span class=\"hljs-string\">\"\ud83d\udd0fAcme Corp Assistant\"<\/span>)\r\n\r\n<span class=\"hljs-comment\"># Initialize the chat messages history<\/span>\r\n<span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-string\">\"messages\"<\/span> <span class=\"hljs-keyword\">not<\/span> <span class=\"hljs-keyword\">in<\/span> st.session_state.keys():\r\n    st.session_state.messages = [\r\n        {<span class=\"hljs-string\">\"role\"<\/span>: <span class=\"hljs-string\">\"assistant\"<\/span>, <span class=\"hljs-string\">\"content\"<\/span>: <span class=\"hljs-string\">\"Hello \ud83d\udc4b!  \\nHow can I help?\"<\/span>}\r\n    ]\r\n\r\n<span class=\"hljs-comment\"># Prompt for user input and save<\/span>\r\n<span class=\"hljs-keyword\">if<\/span> prompt := st.chat_input():\r\n    st.session_state.messages.append({<span class=\"hljs-string\">\"role\"<\/span>: <span class=\"hljs-string\">\"user\"<\/span>, <span class=\"hljs-string\">\"content\"<\/span>: prompt})\r\n\r\n<span class=\"hljs-comment\"># display the prior chat messages<\/span>\r\n<span class=\"hljs-keyword\">for<\/span> message <span class=\"hljs-keyword\">in<\/span> st.session_state.messages:\r\n    <span class=\"hljs-keyword\">with<\/span> st.chat_message(message[<span class=\"hljs-string\">\"role\"<\/span>]):\r\n        st.write(message[<span class=\"hljs-string\">\"content\"<\/span>])\r\n\r\n<span class=\"hljs-comment\"># If last message is not from assistant, we need to generate a new response<\/span>\r\n<span class=\"hljs-keyword\">if<\/span> st.session_state.messages[-<span class=\"hljs-number\">1<\/span>][<span class=\"hljs-string\">\"role\"<\/span>] != <span class=\"hljs-string\">\"assistant\"<\/span>:\r\n    <span class=\"hljs-comment\"># Generate a response<\/span>\r\n    <span class=\"hljs-keyword\">with<\/span> st.chat_message(<span class=\"hljs-string\">\"assistant\"<\/span>):\r\n        <span class=\"hljs-keyword\">with<\/span> st.spinner(<span class=\"hljs-string\">\"Thinking...\"<\/span>):\r\n            response = <span class=\"hljs-string\">\"TODO\"<\/span>\r\n\r\n    message = {<span class=\"hljs-string\">\"role\"<\/span>: <span class=\"hljs-string\">\"assistant\"<\/span>, <span class=\"hljs-string\">\"content\"<\/span>: response}\r\n    st.session_state.messages.append(message)<\/span><\/pre>\n<p id=\"937a\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\">Our simple chat UI looks like this:<\/p>\n<figure class=\"ob oc od oe of og ny nz paragraph-image\">\n<div class=\"oh oi go oj bg ok\" tabindex=\"0\" role=\"button\">\n<div class=\"ny nz pa\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*DeXTsocIYau_OUFX 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*DeXTsocIYau_OUFX 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*DeXTsocIYau_OUFX 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*DeXTsocIYau_OUFX 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*DeXTsocIYau_OUFX 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*DeXTsocIYau_OUFX 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*DeXTsocIYau_OUFX 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" \/><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*DeXTsocIYau_OUFX 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*DeXTsocIYau_OUFX 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*DeXTsocIYau_OUFX 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*DeXTsocIYau_OUFX 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*DeXTsocIYau_OUFX 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*DeXTsocIYau_OUFX 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*DeXTsocIYau_OUFX 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\" \/><img decoding=\"async\" class=\"bg kq ol c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*DeXTsocIYau_OUFX\" alt=\"\" width=\"700\" height=\"880\" \/><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"940e\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">As you can see, the UI accepts a user input, but doesn\u2019t currently integrate with our LLM. Next, we need to send the user input to Skyflow for de-identification before we use RetrievalQA to answer the user\u2019s question. Let\u2019s start with accepting and processing our input data.<\/p>\n<p id=\"64f0\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">To detect and de-identify plaintext sensitive data with Skyflow, we can use the detect API endpoint with code similar to the following:<\/p>\n<pre class=\"ob oc od oe of or os ot bo ou ba bj\"><span id=\"fdb1\" class=\"ov mi ev os b bf ow ox l oy oz\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title.function\">deIdentifyText<\/span>(<span class=\"hljs-params\"><span class=\"hljs-built_in\">input<\/span><\/span>):\r\n   data = {\r\n        <span class=\"hljs-string\">\"text\"<\/span>: [\r\n            {\r\n                <span class=\"hljs-string\">\"message\"<\/span>: <span class=\"hljs-built_in\">input<\/span>\r\n            }\r\n        ],\r\n        <span class=\"hljs-string\">\"deidentify_option\"<\/span>: <span class=\"hljs-string\">\"tokenize\"<\/span>\r\n    }\r\n    response = client.detect(data)\r\n\r\n    <span class=\"hljs-keyword\">return<\/span> response[<span class=\"hljs-number\">0<\/span>].processed_text<\/span><\/pre>\n<p id=\"7e4a\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\">Now that we\u2019ve de-identified the user input data, we can send the question to RetrievalQA, which will then use a QA chain to answer the question from our documents.<\/p>\n<pre class=\"ob oc od oe of or os ot bo ou ba bj\"><span id=\"8835\" class=\"ov mi ev os b bf ow ox l oy oz\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title.function\">performCompletion<\/span>(<span class=\"hljs-params\"><span class=\"hljs-built_in\">input<\/span><\/span>):\r\n     clean_input = deIdentifyText(<span class=\"hljs-built_in\">input<\/span>)\r\n\r\n     qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(temperature=<span class=\"hljs-number\">0.2<\/span>,model_name=<span class=\"hljs-string\">'gpt-3.5-turbo'<\/span>),\r\n                                 chain_type=<span class=\"hljs-string\">\"stuff\"<\/span>, \r\n                                 retriever=vector_store.as_retriever())\r\n    <span class=\"hljs-keyword\">return<\/span> qa.run(clean_input)<\/span><\/pre>\n<p id=\"f4de\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">We now have our response from RetrievalQA. However, we need to take one additional step before we can send it back to our user: detokenize (re-identify) our response through Skyflow\u2019s detokenization API. This is fairly straightforward, similar to previous API calls to Skyflow.<\/p>\n<p id=\"4f8f\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Everything we need is encapsulated by the function performInference, which calls a function to reIdentifyText after the completion is returned.<\/p>\n<p id=\"c358\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Who can see what and in which format is controlled by Skyflow\u2019s governance engine. There\u2019s too much to cover here, but if you want to learn more, see\u00a0<a class=\"af nw\" href=\"https:\/\/www.skyflow.com\/post\/introducing-the-skyflow-data-governance-engine\" target=\"_blank\" rel=\"noopener ugc nofollow\">Introducing the Skyflow Data Governance Engine<\/a>.<\/p>\n<pre class=\"ob oc od oe of or os ot bo ou ba bj\"><span id=\"a6a1\" class=\"ov mi ev os b bf ow ox l oy oz\" data-selectable-paragraph=\"\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title.function\">performInference<\/span>(<span class=\"hljs-params\"><span class=\"hljs-built_in\">input<\/span><\/span>):\r\n    response = performCompletion(<span class=\"hljs-built_in\">input<\/span>)\r\n\r\n    <span class=\"hljs-keyword\">return<\/span> reIdentifyText(response)<\/span><\/pre>\n<p id=\"9728\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\">These final steps connect our entire application from end-to-end. Now, we need to update our UI code from above so that the response is correctly set.<\/p>\n<pre class=\"ob oc od oe of or os ot bo ou ba bj\"><span id=\"76cd\" class=\"ov mi ev os b bf ow ox l oy oz\" data-selectable-paragraph=\"\"><span class=\"hljs-comment\"># If last message is not from assistant, we need to generate a new response<\/span>\r\n<span class=\"hljs-keyword\">if<\/span> st.session_state.messages[-<span class=\"hljs-number\">1<\/span>][<span class=\"hljs-string\">\"role\"<\/span>] != <span class=\"hljs-string\">\"assistant\"<\/span>:\r\n    <span class=\"hljs-comment\"># Generate a response<\/span>\r\n    <span class=\"hljs-keyword\">with<\/span> st.chat_message(<span class=\"hljs-string\">\"assistant\"<\/span>):\r\n        <span class=\"hljs-keyword\">with<\/span> st.spinner(<span class=\"hljs-string\">\"Thinking...\"<\/span>):\r\n            response = performInference(m[<span class=\"hljs-string\">\"content\"<\/span>])<\/span><\/pre>\n<p id=\"46bf\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" data-selectable-paragraph=\"\">With these pieces in place, here\u2019s a quick demo of our privacy-preserving LLM-based chatbot in action:<\/p>\n<figure class=\"ob oc od oe of og ny nz paragraph-image\">\n<div class=\"ny nz pb\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*x3ng5rwENj-25M35m8dthA.gif 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*x3ng5rwENj-25M35m8dthA.gif 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*x3ng5rwENj-25M35m8dthA.gif 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*x3ng5rwENj-25M35m8dthA.gif 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*x3ng5rwENj-25M35m8dthA.gif 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*x3ng5rwENj-25M35m8dthA.gif 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1168\/format:webp\/1*x3ng5rwENj-25M35m8dthA.gif 1168w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 584px\" \/><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*x3ng5rwENj-25M35m8dthA.gif 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*x3ng5rwENj-25M35m8dthA.gif 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*x3ng5rwENj-25M35m8dthA.gif 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*x3ng5rwENj-25M35m8dthA.gif 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*x3ng5rwENj-25M35m8dthA.gif 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*x3ng5rwENj-25M35m8dthA.gif 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1168\/1*x3ng5rwENj-25M35m8dthA.gif 1168w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 584px\" data-testid=\"og\" \/><img loading=\"lazy\" decoding=\"async\" class=\"bg kq ol c\" role=\"presentation\" src=\"https:\/\/i0.wp.com\/miro.medium.com\/v2\/resize:fit:1168\/1*x3ng5rwENj-25M35m8dthA.gif?resize=584%2C612&#038;ssl=1\" alt=\"\" width=\"584\" height=\"612\" data-recalc-dims=\"1\" \/><\/picture><\/div><figcaption class=\"om on oo ny nz op oq be b bf z hb\" data-selectable-paragraph=\"\"><em>Example of the privacy-preserving bot in action.<\/em><\/figcaption><\/figure>\n<h1 id=\"5d94\" class=\"mh mi ev be mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne bj\" data-selectable-paragraph=\"\">Tying it all together<\/h1>\n<p id=\"8240\" class=\"pw-post-body-paragraph lj lk ev ll b lm nf lo lp lq ng ls lt lu nh lw lx ly ni ma mb mc nj me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">In this article, we walked through the general steps to construct a privacy-preserving LLM-based chatbot. With organizations increasingly using LLM-based applications in their businesses and operations, the need to preserve data privacy has become acute. Concerns about protecting the privacy and security of sensitive data are the biggest adoption blocker that prevents many companies from making full use of AI with their datasets.<\/p>\n<p id=\"9b43\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Solving this problem requires identifying the key points where sensitive data might enter your system and need to be de-identified. When working with LLMs, those points occur during model training \u2014 both when building an LLM or customizing one \u2014 and at the user input stage. You can use Skyflow Data Privacy Vault to implement effective de-identification and data governance for LLM-based AI tools like chatbots.<\/p>\n<p id=\"bb2b\" class=\"pw-post-body-paragraph lj lk ev ll b lm ln lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg eo bj\" style=\"text-align: justify;\" data-selectable-paragraph=\"\">Building an LLM-based chatbot requires the use of several tools to ensure that data is handled in a manner that preserves privacy. Taking privacy-preserving measures is critical to prevent the misuse or exposure of sensitive information. By using the tools and methods we\u2019ve demonstrated here, companies can leverage AI\u2019s benefits and promote efficient data-driven decision-making while prioritizing data privacy and protection.<\/p>\n<div style=\"text-align: justify;\">\n<div><span style=\"font-weight: 400;\"><a href=\"https:\/\/twitter.com\/seanfalconer\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"15613\" data-permalink=\"https:\/\/softwareengineeringdaily.com\/2023\/10\/24\/streamlit-with-amanda-kelly\/rectangle-3-3\/\" data-orig-file=\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/10\/Rectangle-3-2.png?fit=218%2C258&amp;ssl=1\" data-orig-size=\"218,258\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Rectangle 3\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/10\/Rectangle-3-2.png?fit=218%2C258&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/10\/Rectangle-3-2.png?fit=218%2C258&amp;ssl=1\" class=\"size-full wp-image-15613 alignleft\" src=\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/10\/Rectangle-3-2.png?resize=218%2C258&#038;ssl=1\" alt=\"\" width=\"218\" height=\"258\" data-recalc-dims=\"1\" \/><\/a>Sean&#8217;s been an academic, startup founder, and Googler. He has published works covering a wide range of topics from information visualization to quantum computing. Currently, Sean is Head of Marketing and Developer Relations at <a href=\"https:\/\/www.skyflow.com\/\">Skyflow<\/a> and host of the podcast Partially Redacted, a podcast about privacy and security engineering. You can connect with Sean on Twitter <a href=\"https:\/\/twitter.com\/seanfalconer\">@seanfalconer.<\/a><\/span><\/div>\n<div><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>As Large Language Models (LLMs) and generative AI continue to grow more sophisticated and available, many organizations are starting to build, fine-tune, and customize LLMs based on their internal data and documents. This can bring incredible efficiency and reliability to data-driven decision-making processes. However, this practice comes with its share of challenges, primarily around data<\/p>\n","protected":false},"author":84,"featured_media":15873,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[1363,83,2143,1080],"tags":[311],"class_list":["post-15857","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-all-episodes","category-articles","category-exclusive-content","category-machine-learning","tag-machine-learning"],"jetpack_publicize_connections":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.8 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Building a Privacy-Preserving LLM-Based Chatbot - Software Engineering Daily<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Building a Privacy-Preserving LLM-Based Chatbot - Software Engineering Daily\" \/>\n<meta property=\"og:description\" content=\"As Large Language Models (LLMs) and generative AI continue to grow more sophisticated and available, many organizations are starting to build, fine-tune, and customize LLMs based on their internal data and documents. This can bring incredible efficiency and reliability to data-driven decision-making processes. However, this practice comes with its share of challenges, primarily around data\" \/>\n<meta property=\"og:url\" content=\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/\" \/>\n<meta property=\"og:site_name\" content=\"Software Engineering Daily\" \/>\n<meta property=\"article:published_time\" content=\"2023-11-27T10:00:07+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-11-21T02:04:32+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"600\" \/>\n\t<meta property=\"og:image:height\" content=\"315\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Sean Falconer\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@software_daily\" \/>\n<meta name=\"twitter:site\" content=\"@software_daily\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Sean Falconer\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/\"},\"author\":{\"name\":\"Sean Falconer\",\"@id\":\"https:\/\/softwareengineeringdaily.com\/#\/schema\/person\/5c9cd686476cc9a9bbfd344ef3da5e31\"},\"headline\":\"Building a Privacy-Preserving LLM-Based Chatbot\",\"datePublished\":\"2023-11-27T10:00:07+00:00\",\"dateModified\":\"2023-11-21T02:04:32+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/\"},\"wordCount\":2236,\"publisher\":{\"@id\":\"https:\/\/softwareengineeringdaily.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1\",\"keywords\":[\"Machine Learning\"],\"articleSection\":[\"All Content\",\"Exclusive Articles\",\"Exclusive Content\",\"Machine Learning\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/\",\"url\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/\",\"name\":\"Building a Privacy-Preserving LLM-Based Chatbot - Software Engineering Daily\",\"isPartOf\":{\"@id\":\"https:\/\/softwareengineeringdaily.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1\",\"datePublished\":\"2023-11-27T10:00:07+00:00\",\"dateModified\":\"2023-11-21T02:04:32+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#primaryimage\",\"url\":\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1\",\"width\":600,\"height\":315},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/softwareengineeringdaily.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Building a Privacy-Preserving LLM-Based Chatbot\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/softwareengineeringdaily.com\/#website\",\"url\":\"https:\/\/softwareengineeringdaily.com\/\",\"name\":\"Software Engineering Daily\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/softwareengineeringdaily.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/softwareengineeringdaily.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/softwareengineeringdaily.com\/#organization\",\"name\":\"Software Engineering Daily\",\"url\":\"https:\/\/softwareengineeringdaily.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/softwareengineeringdaily.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2024\/01\/cropped-sed_website_banner.png?fit=549%2C169&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2024\/01\/cropped-sed_website_banner.png?fit=549%2C169&ssl=1\",\"width\":549,\"height\":169,\"caption\":\"Software Engineering Daily\"},\"image\":{\"@id\":\"https:\/\/softwareengineeringdaily.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/software_daily\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/softwareengineeringdaily.com\/#\/schema\/person\/5c9cd686476cc9a9bbfd344ef3da5e31\",\"name\":\"Sean Falconer\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/softwareengineeringdaily.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/a445af505350d4ba2cc720afd542fbc4?s=96&d=retro&r=pg\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/a445af505350d4ba2cc720afd542fbc4?s=96&d=retro&r=pg\",\"caption\":\"Sean Falconer\"},\"sameAs\":[\"https:\/\/skyflow.com\/\"],\"url\":\"https:\/\/softwareengineeringdaily.com\/author\/seanfalconer\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Building a Privacy-Preserving LLM-Based Chatbot - Software Engineering Daily","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/","og_locale":"en_US","og_type":"article","og_title":"Building a Privacy-Preserving LLM-Based Chatbot - Software Engineering Daily","og_description":"As Large Language Models (LLMs) and generative AI continue to grow more sophisticated and available, many organizations are starting to build, fine-tune, and customize LLMs based on their internal data and documents. This can bring incredible efficiency and reliability to data-driven decision-making processes. However, this practice comes with its share of challenges, primarily around data","og_url":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/","og_site_name":"Software Engineering Daily","article_published_time":"2023-11-27T10:00:07+00:00","article_modified_time":"2023-11-21T02:04:32+00:00","og_image":[{"width":600,"height":315,"url":"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1","type":"image\/png"}],"author":"Sean Falconer","twitter_card":"summary_large_image","twitter_creator":"@software_daily","twitter_site":"@software_daily","twitter_misc":{"Written by":"Sean Falconer","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#article","isPartOf":{"@id":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/"},"author":{"name":"Sean Falconer","@id":"https:\/\/softwareengineeringdaily.com\/#\/schema\/person\/5c9cd686476cc9a9bbfd344ef3da5e31"},"headline":"Building a Privacy-Preserving LLM-Based Chatbot","datePublished":"2023-11-27T10:00:07+00:00","dateModified":"2023-11-21T02:04:32+00:00","mainEntityOfPage":{"@id":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/"},"wordCount":2236,"publisher":{"@id":"https:\/\/softwareengineeringdaily.com\/#organization"},"image":{"@id":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1","keywords":["Machine Learning"],"articleSection":["All Content","Exclusive Articles","Exclusive Content","Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/","url":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/","name":"Building a Privacy-Preserving LLM-Based Chatbot - Software Engineering Daily","isPartOf":{"@id":"https:\/\/softwareengineeringdaily.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#primaryimage"},"image":{"@id":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1","datePublished":"2023-11-27T10:00:07+00:00","dateModified":"2023-11-21T02:04:32+00:00","breadcrumb":{"@id":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#primaryimage","url":"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1","contentUrl":"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1","width":600,"height":315},{"@type":"BreadcrumbList","@id":"https:\/\/softwareengineeringdaily.com\/2023\/11\/27\/building-a-privacy-preserving-llm-based-chatbot\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/softwareengineeringdaily.com\/"},{"@type":"ListItem","position":2,"name":"Building a Privacy-Preserving LLM-Based Chatbot"}]},{"@type":"WebSite","@id":"https:\/\/softwareengineeringdaily.com\/#website","url":"https:\/\/softwareengineeringdaily.com\/","name":"Software Engineering Daily","description":"","publisher":{"@id":"https:\/\/softwareengineeringdaily.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/softwareengineeringdaily.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/softwareengineeringdaily.com\/#organization","name":"Software Engineering Daily","url":"https:\/\/softwareengineeringdaily.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/softwareengineeringdaily.com\/#\/schema\/logo\/image\/","url":"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2024\/01\/cropped-sed_website_banner.png?fit=549%2C169&ssl=1","contentUrl":"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2024\/01\/cropped-sed_website_banner.png?fit=549%2C169&ssl=1","width":549,"height":169,"caption":"Software Engineering Daily"},"image":{"@id":"https:\/\/softwareengineeringdaily.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/software_daily"]},{"@type":"Person","@id":"https:\/\/softwareengineeringdaily.com\/#\/schema\/person\/5c9cd686476cc9a9bbfd344ef3da5e31","name":"Sean Falconer","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/softwareengineeringdaily.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/a445af505350d4ba2cc720afd542fbc4?s=96&d=retro&r=pg","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a445af505350d4ba2cc720afd542fbc4?s=96&d=retro&r=pg","caption":"Sean Falconer"},"sameAs":["https:\/\/skyflow.com\/"],"url":"https:\/\/softwareengineeringdaily.com\/author\/seanfalconer\/"}]}},"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"https:\/\/i0.wp.com\/softwareengineeringdaily.com\/wp-content\/uploads\/2023\/11\/FI-LLM.png?fit=600%2C315&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/p7GuoD-47L","_links":{"self":[{"href":"https:\/\/softwareengineeringdaily.com\/wp-json\/wp\/v2\/posts\/15857"}],"collection":[{"href":"https:\/\/softwareengineeringdaily.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/softwareengineeringdaily.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/softwareengineeringdaily.com\/wp-json\/wp\/v2\/users\/84"}],"replies":[{"embeddable":true,"href":"https:\/\/softwareengineeringdaily.com\/wp-json\/wp\/v2\/comments?post=15857"}],"version-history":[{"count":0,"href":"https:\/\/softwareengineeringdaily.com\/wp-json\/wp\/v2\/posts\/15857\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/softwareengineeringdaily.com\/wp-json\/wp\/v2\/media\/15873"}],"wp:attachment":[{"href":"https:\/\/softwareengineeringdaily.com\/wp-json\/wp\/v2\/media?parent=15857"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/softwareengineeringdaily.com\/wp-json\/wp\/v2\/categories?post=15857"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/softwareengineeringdaily.com\/wp-json\/wp\/v2\/tags?post=15857"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}