Uploading Files to your AI Chatbot

Uploading Files to your AI Chatbot

We've always had the ability to upload files to your cloudlet, and such allow for your AI chatbot to use information found in files as it answer questions.

However, we have recently done some changes in the flow of how to upload files, and we haven't really updated the documentation very well, so I'll write a small article here explaining the logic.

File types we support

In addition to website scraping, we support a whole range of files, but some files might need additional customization to get to work. Out of the box we support the following file types.

  • PDF files
  • XML files
  • JSON files
  • CSV files
  • YAML files

These are divided into two different types of files during import; Structured data and unstructured data. To import files you must go to "Manage/Machine Learning" and chose your type and click "Import". Then you must chose between structured or unstructured file types. PDF is the only unstructured file type we currently have, and all other files types are considered structured. Below is a screenshot.

Importing files to your AI chatbot

How to import XML, JSON, CSV and YAML files

XML, JSON, CSV, and YAML are considered "structured data". The reason for this is because we can accept data that's structured into two columns. When you import structured data, the assumption from the import logic is that there are at least two columns in your file. If you've got an XML file for instance, it should resemble the following.

<data>
  <entry>
    <prompt>Some prompt here</prompt>
    <completion>Some completion here</completion>
  </entry>
  <entry>
    <prompt>Another prompt here</prompt>
    <completion>Another completion here</completion>
  </entry>
</data>

The point being that at the root level of your XML document is an array of multiple items, where each item contains a prompt and a completion part. If you have another structure, the import function can still handle it, as long as you provide it with the correct name for your "prompt" part and the correct name for your "completion" part, and it contacts an array of items at root. What your root node is called is irrelevant.

Overriding the name of your "prompt" and "completion" columns is the purpose of the two textboxes in the import dialog. They allow you to "override" what the name of your prompt and completion columns are. The logic is the exact same for YAML, JSON and CSV files - But all structured imports can only handle two columns, and there must be two columns, and if these columns does not have the right names, you must override them accordingly.

Structured import creates the best result, since it allows you to import structured data, without losing anything, and without adding any "cognitive noise". Cognitive noise will reduce the quality of responses due to static and irrelevant data in your model.

How to import PDF files

PDF files on the other hand are considered "unstructured", implying the import function will have to apply some "guesswork" to break it down into individual training snippets. This process typically doesn't create as high quality, since the default logic is to break down each PDF file into one training snippet for each page in the file. This is necessary to create small training snippets, which again will prevent token overflow, and also allow the chatbot to freely "associate". However, the process can be significantly improved by creating a "massage value".

Importing PDF files to your AI chatbot

A "massage value" is a prompt that will be used as an instruction to OpenAI while asking it to somehow extract the important information, and either create a structure out of it, and/or summarize it, etc.

This significantly increases your training data quality, since it removes static and noise typically found in PDF files. Noise is stuff that isn't really providing any value to your data. Cognitive noise can be design artefacts, redundant spaces, a total lack of white spaces, etc. ChatGPT will in general handle this well though, and return something that's much higher quality than your starting point - But this process will take a lot of time, since it needs to invoke OpenAI once for every single training snippet it imports. If you're uploading hundreds of files, this might take hours.

Downloading backups of your training data

Every now and then you should download a backup of your training data. This ensures you've got a backup if you accidentally delete training snippets. You can do this by chosing your training snippets tab, select the right type, and click "Export". This will download a CSV file containing all training snippets you've currently filtered on.

This CSV file is in the "prompt/completion" format, so it can easily be imported back again later, without changing your default settings during import.

API integrations

In addition to the above, we also provide API integrations, allowing us to extract training data from for instance Shopify, your CRM system, ticket system, etc. In general, having access to highly structured data significantly increases quality compared to scraping websites.

Although we're super proud of our website scraper, and we believe we've got one of the strongest website scrapers in the industry, it's always an advantage to ingest structured data.

If you've got data in a completely custom format, we can also deal with it somehow - But this requires manual work. Such manual work is an integrated part of our service when we onboard a new client though, so from your perspective it's not an issue.

Thomas Hansen

Thomas Hansen I am the CTO of AINIRO.IO AS. I am a software developer with more than 25 years of experience. I write about Machine Learning, AI, and how to help organizations adopt said technologies. You can follow me on LinkedIn if you want to read more of what I write.

Published 9. Feb 2024