CreLang-design/db.md

3.7 KiB

Database

To persist the information as specified in Use Cases, we need to design an appropriate database accordingly. This section discusses the structure for each information.

Due to the target for Android development, SQLite is chosen as the platform's database. SQLite is a relational database that has a programming interface in virtually any modern programming language. On top of that, it is natively supported on both Android and iOS. It is also suitable for a custom application file format. In fact, Anki used SQLite for its flashcard deck file. SQLite supports databases up to 281 terabytes, which is more than enough for the use case.

Phonology and Phonetics

Set of consonants, vowels, tones

The database for this would likely be mostly fixed in number of rows.

  • IPA (TEXT): The International Phonetic Alphabet representation of the sound.
  • X_SAMPA (TEXT): The X-SAMPA equivalent, which allows user to type on a non-IPA keyboard
  • is_used (BOOLEAN): Whether the sound is used in the language

Grammar

According to Chomsky, a grammar consists of:

  • vocabulary V
    • a subset T of V consisting of terminal symbols -- we call this word list
    • a subset N of V consisting of non-terminal symbols -- we call this word classes (e.g. nouns, adjectives, noun phrase)
  • a start symbol S -- this is usually a phrase or sentence
  • a finite set of productions P 1

In this project, grammar only means the production rules.

Each production rule is characterized by following columns:

  • name (TEXT, unique): The human-friendly identifier for the rule
  • transformation_syntax(INTEGER): id for the transformation syntax
  • before (TEXT): the strings to be transformation
  • after (TEXT) The string after the transformation
  • description (TEXT): Rule description to be added to the document

There are two transformation syntax (more elaborated at Implementation chapter):

  • RegEx
  • C-style string format

Each kind of production rule is represented in its own table and is described in following subsections.

Inflection

Inflections are usually unique to only one part of speech, so they need a column for the part of speech.

Extra columns:

  • part_of_speech (INTEGER): ID of the part of speech

Phrase Syntax

TBD

Derivation

Derivation is like inflection, except that it usually changes its part of speech. By default, derivation rule is disabled.

Extra column:

  • part_of_speech_before (INTEGER): ID of the part of speech the rule apply to
  • part_of_speech_after (INTEGER): ID of the part of speech the rule transform the verb to

Writing system

Orthography rules

The table for orthography rules divides in two categories: hard rules and soft rules. Hard rules are enforced by the program to check if a newly added word follows the rule, and if a body of text follows such rule. Soft rules are human-readable rules that will be exported into document. The hard rules can be defined using RegEx, anti-RegEx (matching texts that are disallowed), BNF, or EBNF.

Columns:

  • type (TEXT): regex, anti-regex, bnf, ebnf, soft
  • rule (TEXT)

Scripts

Columns:

  • name (TEXT)
  • glyph (BLOB): the content of the vector file

Vocabulary

Part of Speech & Word Class

They share the same structure

Columns:

  • name (TEXT)

Word List

Columns:

  • word (TEXT)
  • part_of_speech (INTEGER): part of speech ID
  • word_class (INTEGER): word class ID
  • definition (TEXT): Can be translation into a natural language, or the native conlang -- this is up to the user.

  1. Kenneth H. Rosen, Discrete Mathematics and Its Applications ↩︎