In order to prepare training dataset for NLU text Bots, we can cluster existing historic dialogues with users and clients (intent mining). The history of dialogues is being divided into topics, and each topic is to contain examples of users’ utterances and ready-made responses from the Operator. Examples of users’ statements are later included in the training set of phrases for intents, and Operator's responses are used when writing Bot's responses in the Dialogue Scenario.
Data Format for clustering is as follows:
The table should be in CSV format.
Each message (MESSAGE) should be placed in a separate row of the table.
Each dialogue is assigned a unique ID number (DIALOG_ID) - the dialogue ID should be a number.
All messages within each individual dialogue are labeled as belonging to either the client or the Operator: MESSAGE_TYPE is equal to 0 if it is a customer's message, and equal to 1 if it is an Operator's message.
Messages within the dialogue should be accompanied by the date and time of sending: DIALOG_DT.
All quotation marks should be removed from within the message strings.
An example file with data in the correct format is available