Dataset

The dataset is, simply put, the set of data that a model works with. It's commonly split into two datasets: training and testing. If you know statistical models (e.g. kNN), it's essentially the same thing.

Essentially, first you need to go out and get a bunch of data from somewhere. In the case study specifically, they talk about real data and synthetic data.

  • Real data is scraped from real interactions, for instance if currently there is a support team (that the chatbot will replace), then you can acquire tons of real data by monitoring all conversations between customers and support staff. It's "real" because it depicts real interactions, and as such has a wide range of examples that the AI can be trained on.
  • Synthetic data, on the other hand, is made up: essentially it's two guys in a box office that pretend to be a customer and support staff having a conversation. Whilst synthetic data has the benefits of being able to produce way more of it (due to not needing real customers, which are quite "scarce"), the drawback is that the data is synthetic -- it has a limited range of interactions, which can cause the AI to get stumped or otherwise behave strangely if it encounters something it's never seen before.

Training

An important principle to understand with anything related to AI is that it's output is entirely dependent on the quality of its input. In other words: garbage in, garbage out. This means that the training dataset has to be diverse and unbiased, in order to provide a wide range of information for the chatbot to pull its answers from.

Usually, when collecting data, it's split between "training" and "testing" datasets (say, 80% training, 20% testing). The reason for this is to verify how the model responds to example data that specifically wasn't in the training dataset (the "testing" dataset is not used during training). This is to ensure that the model doesn't simply memorise and verbatim repeat many parts of its training dataset, which can be a problem if the memory/capacity/something/whatever of the model is large enough to fit the entire dataset?

Testing

The testing part of a dataset is a small part to verify that the model works correctly, and to determine its accuracy on never-before-seen data. This data isn't used for training, and the model doesn't have access to it at all until the training is complete, and the whole chatbot is verified top-to-bottom using the testing data.