CreLang-design/db.md

125 lines
3.7 KiB
Markdown

# Database
To persist the information as specified in [Use Cases](uc.md),
we need to design an appropriate database accordingly.
This section discusses the structure for each information.
Due to the target for Android development, SQLite is chosen as the platform's
database. SQLite is a relational database that has a programming interface in
virtually any modern programming language. On top of that, it is natively
supported on both Android and iOS. It is also suitable for a custom
[application file format][app-format]. In fact, Anki used SQLite for its
flashcard deck file. SQLite supports databases up to [281 terabytes][size],
which is more than enough for the use case.
[app-format]: https://sqlite.org/appfileformat.html
[size]: https://sqlite.org/whentouse.html
## Phonology and Phonetics
### Set of consonants, vowels, tones
The database for this would likely be mostly fixed in number of rows.
- IPA (`TEXT`): The International Phonetic Alphabet representation of the sound.
- X_SAMPA (`TEXT`): The X-SAMPA equivalent,
which allows user to type on a non-IPA keyboard
- is_used (`BOOLEAN`): Whether the sound is used in the language
## Grammar
According to Chomsky, a grammar consists of:
- vocabulary V
- a subset T of V consisting of terminal symbols
-- we call this word list
- a subset N of V consisting of non-terminal symbols
-- we call this word classes
(e.g. nouns, adjectives, noun phrase)
- a start symbol S -- this is usually a phrase or sentence
- a finite set of productions P [^1]
In this project, grammar only means the production rules.
Each production rule is characterized by following columns:
- name (`TEXT`, unique): The human-friendly identifier for the rule
- transformation_syntax(`INTEGER`): id for the transformation syntax
- before (`TEXT`): the strings to be transformation
- after (`TEXT`) The string after the transformation
- description (`TEXT`): Rule description to be added to the document
There are two transformation syntax (more elaborated at
[Implementation](implementation.md) chapter):
- RegEx
- C-style string format
Each kind of production rule is represented in its own table
and is described in following subsections.
### Inflection
Inflections are usually unique to only one part of speech, so they need a
column for the part of speech.
Extra columns:
- part_of_speech (`INTEGER`): ID of the part of speech
### Phrase Syntax
TBD
### Derivation
Derivation is like inflection, except that it usually changes its part of
speech. By default, derivation rule is disabled.
Extra column:
- part_of_speech_before (`INTEGER`): ID of the part of speech the rule apply to
- part_of_speech_after (`INTEGER`): ID of the part of speech the rule transform the verb to
## Writing system
### Orthography rules
The table for orthography rules divides in two categories: hard rules and soft
rules. Hard rules are enforced by the program to check if a newly added word
follows the rule, and if a body of text follows such rule. Soft rules are
human-readable rules that will be exported into document. The hard rules can be
defined using RegEx, anti-RegEx (matching texts that are disallowed), BNF, or
EBNF.
Columns:
- type (`TEXT`): regex, anti-regex, bnf, ebnf, soft
- rule (`TEXT`)
### Scripts
Columns:
- name (`TEXT`)
- glyph (`BLOB`): the content of the vector file
## Vocabulary
### Part of Speech & Word Class
They share the same structure
Columns:
- name (`TEXT`)
### Word List
Columns:
- word (`TEXT`)
- part_of_speech (`INTEGER`): part of speech ID
- word_class (`INTEGER`): word class ID
- definition (`TEXT`): Can be translation into a natural language,
or the native conlang -- this is up to the user.
[^1]: Kenneth H. Rosen, Discrete Mathematics and Its Applications