Custom entities

When you need to integrate industry-specific knowledge into your bot, you can use a CorpusExtractor or a RegexExtractor.

CorpusExtractor

Implementation

To implement your own corpus extractor you'll need two things:

  • A corpus extractor class
  • A corpus file

The corpus extractor class

This class extends the CorpusExtractor class:

const { CorpusExtractor, FileCorpus } = require('botfuel-dialog');

class DimensionNameExtractor extends CorpusExtractor {
  constructor() {
    super({
      dimension: 'dimension_name',
      corpus: new FileCorpus(`${__dirname}/../corpora/dimension_name.txt`),
      options: {
        caseSensitive: false,
        keepQuotes: false,
        keepDashes: false,
        keepAccents: false,
      },
    });
  }
}

There are two required parameters to subclass the CorpusExtractor:

  • dimension is the name of the extracted dimension
  • corpus is a FileCorpus instance that turns the file passed in the argument into a matrix of words that represents the corpus

The optional parameter options has four possible parameters to specify how the sentence should be normalized before entity computation:

  • caseSensitive to keep capital letters
  • keepQuotes to keep quotes in the sentence
  • keepDashes to keep dashes in the sentence
  • keepAccents to keep accents in the sentence

The corpus file

This file contains all the words and synonyms you want to be recognized for the defined entity. It is structured with one keyword and its synonyms per row, separated by commas.

For example, the transmission.txt looks like this:

automatic,auto
manual

This corpus has two keywords - automatic and manual and automatic has one synonym - auto.

Extracted entities format

Each entity extracted with the CorpusExtractor will be an object with the same format as entities extracted with the WsExtractor:

{
  dim: 'dimension_name',
  body: '<substring_extracted>',
  values: [{ value: '<entity_value>', type: 'string' }],
  start: '<substring_start_index>',
  end: '<substring_end_index>',
}

In the majority of cases values will contain only one value for this extractor, but we have decided to keep this format for compatibility reasons.

Let's illustrate this with the TransmissionExtractor in test-complexdialogs bot.

For example if you say "Is this car available with an auto or manual transmission?" to the bot, this extractor will extract two transmission entities:

[
  {
    dim: 'transmission',
    body: 'auto',
    values: [{ value: 'automatic', type: 'string' }],
    start: 30,
    end: 34,
  },
  {
    dim: 'transmission',
    body: 'manual',
    values: [{ value: 'manual', type: 'string' }],
    start: 38,
    end: 44,
  },
];

If you would prefer a different format of extracted entities values, you can implement the buildValue() method in your corpus extractor class.

This method returns an object containing a value and its type:

/**
 * Builds the object value from a string.
 * @param {String} value - the string
 * @returns {Object} the object value
 */
buildValue(value) {
  return { value, type: 'string' };
}

For example, Botfuel Dialog comes with a BooleanExtractor class to extract boolean entities in answers. This extractor extends the CorpusExtractor and overrides the buildValue() method like this:

buildValue(value) {
  return { value: value === '1', type: 'boolean' };
}

RegexExtractor

Implementation

To implement a regex extractor you'll need to extend your extractor class with the RegexExtractor class.

For example, if you want to extract emails from user sentences, you can define a regex extractor like the following:

const { RegexExtractor } = require('botfuel-dialog');

const EMAIL_REGEX = /([a-zA-Z0-9._+-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/;

class EmailExtractor extends RegexExtractor {
  constructor() {
    super({
      dimension: 'email',
      regex: EMAIL_REGEX,
    });
  }
}

There are two required parameters to subclass the RegexExtractor:

  • dimension is the name of the extracted dimension
  • regex is the regex used to extract matching substrings in user sentences

Note that the global flag is automatically added to the regex to be able to extract all matching substrings in the sentence provided.

Extracted entities format

The format of extracted entities is the same as for the CorpusExtractor or the WsExtractor.