Getting deeper: Knowledge base (Training)

This documentation explains how to train your replicas using the Sensay API. Training is essential for creating personalized replicas that can provide accurate and relevant responses based on your specific content.

What is a knowledge base?

A knowledge base is a collection of information that your replica uses to answer questions. It's the foundation of your replica's ability to provide accurate and contextually relevant responses. All training in Sensay relies on knowledge base entries.

Knowledge base workflow

Knowledge base entries follow different processing paths depending on their type (text, file, website, or YouTube). Each entry progresses through a series of status stages as it's processed:

Processing stages

A high-level view of the system is represented in this diagram. For each specific file type, refer to the more details state descriptions below.

---
displayMode: compact
---
stateDiagram-v2
    direction LR
    classDef badBadEvent font-family:"Consolas, monaco, monospace",fill:#f00,color:white,font-weight:bold,stroke-width:2px,stroke:yellow
    classDef greenEvent font-family:"Consolas, monaco, monospace",fill:#0f0,color:black,font-weight:bold,stroke-width:2px,stroke:yellow
    classDef ms font-family:"Consolas, monaco, monospace";

    [*] --> NEW:::ms

    NEW --> FILE_UPLOADED:::ms: File upload

    FILE_UPLOADED --> RAW_TEXT:::ms
    note left of RAW_TEXT
        Content extracted
    end note

    NEW --> RAW_TEXT:::ms: Crawler fetch

    RAW_TEXT --> PROCESSED_TEXT:::ms
    note left of PROCESSED_TEXT
        Content cleaned and optimized
    end note

    PROCESSED_TEXT --> VECTOR_CREATED:::greenEvent
    note left of VECTOR_CREATED
        Knowledge base updated and Replica is ready to use
    end note

    VECTOR_CREATED --> READY:::greenEvent
    note left of READY
        Knowledge base optimised and Replica is ready to use
    end note

    %% Error handling paths
    NEW --> UNPROCESSABLE:::badBadEvent
    FILE_UPLOADED --> UNPROCESSABLE:::badBadEvent
    RAW_TEXT --> UNPROCESSABLE:::badBadEvent
    PROCESSED_TEXT --> UNPROCESSABLE:::badBadEvent
    note left of UNPROCESSABLE
        The request cannot be handled and recovery is not possible
    end note


The processing pipeline automatically moves entries through these stages. Normally it will take about 5 minutes to process a new entry from NEW to VECTOR_CREATED, depending on the content size and type, but might take longer for larger files, complex websites, or during peak hours. It might take up to 24 hours for the system to process an entry from VECTOR_CREATED to READY.


An entry will be marked as UNPROCESSABLE only if processing is fundamentally not possible (e.g., corrupted files, URLs requiring authorization, private YouTube videos). For temporary processing errors that might succeed on retry, the entry status remains unchanged but an error message is associated with the entry.

Processing paths by entry type

Different types of knowledge base entries follow different processing paths:

Text entries
stateDiagram-v2
    direction LR
    classDef badBadEvent font-family:"Consolas, monaco, monospace",fill:#f00,color:white,font-weight:bold,stroke-width:2px,stroke:yellow
    classDef greenEvent font-family:"Consolas, monaco, monospace",fill:#0f0,color:black,font-weight:bold,stroke-width:2px,stroke:yellow
    classDef ms font-family:"Consolas, monaco, monospace";

    [*] --> RAW_TEXT:::ms
    RAW_TEXT --> PROCESSED_TEXT:::ms
    note left of PROCESSED_TEXT
        Content cleaned and optimized
    end note

    PROCESSED_TEXT --> VECTOR_CREATED:::greenEvent
    note left of VECTOR_CREATED
        Knowledge base updated and Replica is ready to use
    end note

    VECTOR_CREATED --> READY:::greenEvent
    note left of READY
        Knowledge base optimised and Replica is ready to use
    end note

    %% Error handling paths
    RAW_TEXT --> UNPROCESSABLE:::badBadEvent: Error
    PROCESSED_TEXT --> UNPROCESSABLE:::badBadEvent: Error
    note right of UNPROCESSABLE
        The request cannot be handled and recovery is not possible
    end note
File entries
stateDiagram-v2
    direction LR
    classDef badBadEvent font-family:"Consolas, monaco, monospace",fill:#f00,color:white,font-weight:bold,stroke-width:2px,stroke:yellow
    classDef greenEvent font-family:"Consolas, monaco, monospace",fill:#0f0,color:black,font-weight:bold,stroke-width:2px,stroke:yellow
    classDef ms font-family:"Consolas, monaco, monospace";

    [*] --> NEW:::ms
    note right of NEW
        File waiting to be uploaded
    end note

    NEW --> FILE_UPLOADED:::ms: File upload

    FILE_UPLOADED --> RAW_TEXT:::ms
    note left of RAW_TEXT
        Text extracted from the file
    end note

    RAW_TEXT --> PROCESSED_TEXT:::ms
    note left of PROCESSED_TEXT
        Content cleaned and optimized
    end note

    PROCESSED_TEXT --> VECTOR_CREATED:::greenEvent
    note left of VECTOR_CREATED
        Knowledge base updated and Replica is ready to use
    end note

    VECTOR_CREATED --> READY:::greenEvent
    note left of READY
        Knowledge base optimised and Replica is ready to use
    end note

    %% Error handling paths
    NEW --> UNPROCESSABLE:::badBadEvent: Upload expired
    FILE_UPLOADED --> UNPROCESSABLE:::badBadEvent: Error (e.g. file is empty)
    RAW_TEXT --> UNPROCESSABLE:::badBadEvent: Error
    PROCESSED_TEXT --> UNPROCESSABLE:::badBadEvent: Error
    note right of UNPROCESSABLE
        The request cannot be handled and recovery is not possible
    end note
Website entries
stateDiagram-v2
    direction LR
    classDef badBadEvent font-family:"Consolas, monaco, monospace",fill:#f00,color:white,font-weight:bold,stroke-width:2px,stroke:yellow
    classDef greenEvent font-family:"Consolas, monaco, monospace",fill:#0f0,color:black,font-weight:bold,stroke-width:2px,stroke:yellow
    classDef ms font-family:"Consolas, monaco, monospace";

    [*] --> NEW:::ms
    note right of NEW
        Waiting for the crawer to fetch the content of the website
    end note

    NEW --> RAW_TEXT:::ms
    note left of RAW_TEXT
        Content extracted from the website
    end note

    RAW_TEXT --> PROCESSED_TEXT:::ms
    note left of PROCESSED_TEXT
        Content cleaned and optimized
    end note

    PROCESSED_TEXT --> VECTOR_CREATED:::greenEvent
    note left of VECTOR_CREATED
        Knowledge base updated and Replica is ready to use
    end note

    VECTOR_CREATED --> READY:::greenEvent
    note left of READY
        Knowledge base optimised and Replica is ready to use
    end note

    %% Error handling paths
    NEW --> UNPROCESSABLE:::badBadEvent: Error (e.g. URL cannot be accessed)
    RAW_TEXT --> UNPROCESSABLE:::badBadEvent: Error
    PROCESSED_TEXT --> UNPROCESSABLE:::badBadEvent: Error
    note right of UNPROCESSABLE
        The request cannot be handled and recovery is not possible
    end note
YouTube entries
stateDiagram-v2
    direction LR
    classDef badBadEvent font-family:"Consolas, monaco, monospace",fill:#f00,color:white,font-weight:bold,stroke-width:2px,stroke:yellow
    classDef greenEvent font-family:"Consolas, monaco, monospace",fill:#0f0,color:black,font-weight:bold,stroke-width:2px,stroke:yellow
    classDef ms font-family:"Consolas, monaco, monospace";

    [*] --> NEW:::ms
    note right of NEW
        Waiting for the crawer to fetch the content of the video
    end note

    NEW --> RAW_TEXT:::ms
    note left of RAW_TEXT
        Content extracted from the video
    end note

    RAW_TEXT --> PROCESSED_TEXT:::ms
    note left of PROCESSED_TEXT
        Content cleaned and optimized
    end note

    PROCESSED_TEXT --> VECTOR_CREATED:::greenEvent
    note left of VECTOR_CREATED
        Knowledge base updated and Replica is ready to use
    end note

    VECTOR_CREATED --> READY:::greenEvent
    note left of READY
        Knowledge base optimised and Replica is ready to use
    end note

    %% Error handling paths
    NEW --> UNPROCESSABLE:::badBadEvent: Error (e.g. Video is private)
    RAW_TEXT --> UNPROCESSABLE:::badBadEvent: Error
    PROCESSED_TEXT --> UNPROCESSABLE:::badBadEvent: Error
    note right of UNPROCESSABLE
        The request cannot be handled and recovery is not possible
    end note

Adding content to the knowledge base

There are four methods to add content to your replica's knowledge base:

Adding text content

Create a knowledge base entry with text content

curl -X POST https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json" \
 -d '{
   "text": "The way to the stars is written in starlight."
 }'

Example response:

{
  "success": true,
  "results": [
    {
      "type": "TEXT",
      "enqueued": true,
      "knowledgeBaseID": 12345
    }
  ]
}


Export the ID into a variable: export KNOWLEDGE_BASE_ID=

This creates a new knowledge base entry with your text content and automatically starts processing it.

Wait for the content to be trained


You will need to wait for the status to be either VECTOR_CREATED or READY. You can check the training status via polling.

curl -X GET https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base/$KNOWLEDGE_BASE_ID \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json"

Example response:

{
  "success":true,
  "id":177030,
  "replicaUUID":"db2cc1de-cbe9-46bf-a428-1144145b7311",
  "type":"text",
  "status":"VECTOR_CREATED"
}

You can now chat with the replica using the new content

curl -X POST https://api.sensay.io/v1/replicas/$REPLICA_UUID/chat/completions \
 -H "Content-Type: application/json" \
 -H "X-API-Version: $API_VERSION" \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "X-USER-ID: $USER_ID" \
 -d '{"content":"What is the way to the stars written in?"}'

Example response:

{
  "success":true,
  "content":"The way to the stars is metaphorically \"written in starlight.\""
}

Uploading text-based files, documents or media files

  1. Create a knowledge base entry for file upload

Making sure that the file extension is representative of the content of the file, create a new knowledge base item:

curl -X POST https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json" \
 -d '{
   "filename": "your_file.txt"
 }'

Example response:

{
  "success": true,
  "results": [
    {
      "type": "FILE",
      "enqueued": true,
      "knowledgeBaseID": 12345,
      "signedURL": "https://storage.googleapis.com/..."
    }
  ]
}


Export the ID into a variable: export KNOWLEDGE_BASE_ID=


Export the Signed URL into a variable: export SIGNED_URL=

This creates a knowledge base entry for the file upload and returns a special URL where you can upload your file, along with the knowledge base ID for tracking. Files up to 50MB are supported. You can check the list of supported file types here.

  1. Upload the file to the signed URL

Making sure that the MIME Type is representative of the content of the file, upload your file:

echo "The way to earth is written in dust. The content of a plain file needs to be at least 50 characters." >> your_file.txt
curl -X PUT $SIGNED_URL \
 -H "Content-Type: text/plain" \
 --data-binary @your_file.txt

You can check the list of supported MIME types here.

Wait for the content to be trained


You will need to wait for the status to be either VECTOR_CREATED or READY. You can check the training status via polling.

curl -X GET https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base/$KNOWLEDGE_BASE_ID \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json"

Example response:

{
  "success":true,
  "id":177030,
  "replicaUUID":"db2cc1de-cbe9-46bf-a428-1144145b7311",
  "type":"text",
  "status":"VECTOR_CREATED"
}

You can now chat with the replica using the new content

curl -X POST https://api.sensay.io/v1/replicas/$REPLICA_UUID/chat/completions \
 -H "Content-Type: application/json" \
 -H "X-API-Version: $API_VERSION" \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "X-USER-ID: $USER_ID" \
 -d '{"content":"What is the way to earth written in?"}'

Example response:

{
  "success":true,
  "content":"The way to Earth is metaphorically \"written in dust.\" This phrase serves as a foundational idea or metaphor, suggesting a profound or inherent truth about the path to Earth. It implies a journey of introspection and understanding, where the path unfolds as one connects with the core of their aspirations."
}

Adding website content

curl -X POST https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json" \
 -d '{
   "url": "https://en.wikipedia.org/wiki/National_Guard_of_Georgia",
   "autoRefresh": false
 }'


The autoRefresh parameter (optional, defaults to false) allows the system to automatically update the content when the source changes. The refresh interval is automatically determined and can not be customized.


Export the ID into a variable: export KNOWLEDGE_BASE_ID=

Wait for the content to be trained


You will need to wait for the status to be either VECTOR_CREATED or READY. You can check the training status via polling.

curl -X GET https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base/$KNOWLEDGE_BASE_ID \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json"

Example response:

{
  "success":true,
  "id":177030,
  "replicaUUID":"db2cc1de-cbe9-46bf-a428-1144145b7311",
  "type":"text",
  "status":"VECTOR_CREATED"
}

You can now chat with the replica using the new content

curl -X POST https://api.sensay.io/v1/replicas/$REPLICA_UUID/chat/completions \
 -H "Content-Type: application/json" \
 -H "X-API-Version: $API_VERSION" \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "X-USER-ID: $USER_ID" \
 -d '{"content":"What is the GNG?"}'

Example response:

{
  "success":true,
  "content":"The National Guard of Georgia (GNG) is a branch of the Defense Forces of Georgia, serving as a gendarmerie, guard of honour, and military reserve force. It was established on December 20, 1990, making it the first national military formation in then-Soviet Georgia. The GNG plays a multifaceted role, including responsibilities in civil affairs, internal security, natural disaster response, and support for military operations. It also has a significant historical role, having participated in major conflicts such as the Georgian Civil War and the Georgian-Ossetian and Georgian-Abkhaz conflicts in the early 1990s."
}

Adding YouTube content

curl -X POST https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json" \
 -d '{
   "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ&list=RDdQw4w9WgXcQ"
 }'

Supported YouTube URL formats:

  • Single videos: https://www.youtube.com/watch?v=VIDEO_ID
  • YouTube Shorts: https://www.youtube.com/shorts/SHORT_VIDEO_ID
  • Playlists: https://www.youtube.com/playlist?list=PLAYLIST_ID


Export the ID into a variable: export KNOWLEDGE_BASE_ID=

Wait for the content to be trained


You will need to wait for the status to be either VECTOR_CREATED or READY. You can check the training status via polling.

curl -X GET https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base/$KNOWLEDGE_BASE_ID \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json"

Example response:

{
  "success":true,
  "id":177030,
  "replicaUUID":"db2cc1de-cbe9-46bf-a428-1144145b7311",
  "type":"text",
  "status":"VECTOR_CREATED"
}

You can now chat with the replica using the new content

curl -X POST https://api.sensay.io/v1/replicas/$REPLICA_UUID/chat/completions \
 -H "Content-Type: application/json" \
 -H "X-API-Version: $API_VERSION" \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "X-USER-ID: $USER_ID" \
 -d '{"content":"What are you never gonna do?"}'

Example response:

{
  "success":true,
  "content":"There are several things that are promised to never be done, such as letting someone down, running around and deserting them, making them cry, saying goodbye, and telling a lie that would hurt them. These promises highlight a commitment to positive and supportive behavior."
}

Managing knowledge base entries

List all knowledge base entries

curl -X GET https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json"

Example response:

{
  "success": true,
  "items": [
    {
      "id": 12345,
      "replicaUUID": "12345678-1234-1234-1234-123456789abc",
      "type": "TEXT",
      "status": "READY",
      "rawText": "Our company was founded in 2020...",
      "createdAt": "2025-04-15T08:11:00.093761+00:00",
      "updatedAt": "2025-04-15T08:11:05.299349+00:00",
      "title": "Company Information",
      "summary": "Basic company details and policies"
    }
  ],
  "total": 1
}

You can filter results using query parameters:

  • status: Filter by processing status (e.g., READY, PROCESSING)
  • type: Filter by entry type (e.g., TEXT, FILE, WEBSITE)

This endpoint also supports pagination.

Get a specific knowledge base entry

curl -X GET https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base/$KNOWLEDGE_BASE_ID \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json"

Example response:

{
  "id": 12345,
  "replicaUUID": "12345678-1234-1234-1234-123456789abc",
  "type": "TEXT",
  "status": "READY",
  "rawText": "Your training text content...",
  "createdAt": "2025-04-15T08:11:00.093761+00:00",
  "updatedAt": "2025-04-15T08:11:05.299349+00:00",
  "title": "Company Information",
  "summary": "Basic company details including founding date, business focus, and operating hours."
}

Update a knowledge base entry

You may want to update a knowledge base entry when information becomes outdated, needs corrections, or requires expansion. Common scenarios include updating product information, revising company policies, or correcting errors in previously uploaded content.

When you update an entry's content (like rawText), the system will reprocess it through the processing pipeline only if the status changes. For example, updating rawText will set the status to RAW_TEXT and trigger reprocessing through the full pipeline (RAW_TEXTPROCESSED_TEXTVECTOR_CREATEDREADY). This ensures your replica uses the most current information when responding to questions.

curl -X PATCH https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base/$KNOWLEDGE_BASE_ID \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json" \
 -d '{
   "rawText": "Updated text content for your knowledge base entry.",
   "title": "Updated Title"
 }'

Example response:

{
  "success": true
}

Delete a knowledge base entry

curl -X DELETE https://api.sensay.io/v1/replicas/$REPLICA_UUID/knowledge-base/$KNOWLEDGE_BASE_ID \
 -H "X-ORGANIZATION-SECRET: $ORGANIZATION_SECRET" \
 -H "Content-Type: application/json"

Example response:

{
  "success": true
}