Hello, this is ITOH(@takahi_i), a software engineer at ATL. I released a document checker written in Natural Languages called RedPen. (Sorry for my late announcement with the Beta version.) The target documents for this tool written in Natural Languages include manuals, essays, e-mails, etc. Here’s the project URL below.

http://redpen.cc

This article demonstrates the RedPen features and how to use it.

Features of RedPen

When RedPen finds invalid expressions, complicated sentences, and obvious inconsistencies used in a document, it generates a warning. RedPen currently provides very primitive features as follows:

  • Length of sentences and paragraphs
  • Invalid expressions
  • nvalid symbols
  • Katakana end hyphen
  • Katakana spell check
  • Space before/after symbols
  • The reason why RedPen currently provides only the primitive features is that what we want to focus on with this checker is not a sophisticated analysis which is researched in the field of Natural Language Processing, but a detection of obvious invalidation and inappropriate format processed by a static analysis tool of the programming language.

    Supported format

    RedPen currently supports Markdown, one of the tags in Textile and plain text format.

    Usage of RedPen

    Let’s get started with RedPen using the supplied configuration file and sample document. Please refer to the manual for more detailed information of how-to-use.

    Sample Setting

    RedPen has 3 kinds of configuration files. One is a whole configuration file and it specifies the other two configuration files and document language. (In the setting below Japanese “ja” is specified.)

    The other two configuration files specified in the whole configuration files are the Validator configuration file and the character configuration file.

    Validator configuration file

    In the Validator configuration file, adds a Validator which covers what you want to check. Validator checks an input document for each perspective, for example, the length of input sentences and invalid symbols.

    In the setting above, these are registered;

  • Length of sentences (SentenceLength)
  • Invalid characters (InvalidCharacter)
  • Spaces (SpaceWithSymbol)
  • Katakana end hyphens (KatakanaEndHypen)
  • Katakana spell check(KatakanaSpell)
  • が登録されています。

    Character configuration file

    The default setting of character set is determined by the language (lang field) of the main configuration file. If you want to use a different setting from the default character setting of a specified language, you can override the default setting with the character setting file.

    In the character configuration file, you define characters you use, characters you must not use and if you need a white space between characters. I do not explain it furthermore but I define an end period as “。” and a comma as “、”. invalid-char detects the invalid characters which have equivalent symbolic meanings. For example, in the setting below, comma is defined as “、” and “,” as invalid comma characters.

    Sample document

    We use the following sample document which is supplied in the RedPen package.

    Installation and running

    First of all, download RedPen.

    Installation of RedPen

    With the following procedure, download RedPen and then install it. (You need Git and Maven installed.)

    Running of RedPen

    Then, run the installed RedPen.

    Result after run

    When you run the command above, some errors come out as you see below.

    Now, take a look at the result above. In the result above, you can see 4 errors (Validation Error) come out. The first error is that the first sentence is too long in the input document. The second one is that the used comma is different from the registered one. The third one is that you do not need the last hyphen in the word ”サーバー”. The fourth one is to tell that there is a Katakana word “インデクス” of which appearance of written characters looks like a Katakana word “インデックス”.

    Current status and future

    RedPen has been released as a beta version. The application works only own its own, and there will be many changes up to the official release. Format of the configuration files (XML) will be greatly changed as per engineers’ requirement. Moreover, the interface of Validator will be changed as the engineers make an active suggestion. So, I would like to ask for your kind understanding of these inconveniences.

    The biggest thing we feel unsatisfied with is the complicated settings when we start using the current version of RedPen. For example, in the current version, you need to make an invalid expression list by yourself to use Validator which detects invalid expressions (InvalidExpressionValidator). In order to be able to use it without settings or adding resources by providing a default setting for each language and by having Validator which can be used without any setting, we want to solve the problem of the complicated settings. As a Validator without any setting we are thinking to check if there is a formal name for an abbreviated notation in an input document and also if there is a mixture of written language and spoken language of Japanese.

    Other big improvements will include a supported plugin to add Validator easily. I will develop RedPen gradually, so please look over me with kind eyes. Thank you.

    TAGS: