3.7 KiB
Database
To persist the information as specified in Use Cases, we need to design an appropriate database accordingly. This section discusses the structure for each information.
Due to the target for Android development, SQLite is chosen as the platform's database. SQLite is a relational database that has a programming interface in virtually any modern programming language. On top of that, it is natively supported on both Android and iOS. It is also suitable for a custom application file format. In fact, Anki used SQLite for its flashcard deck file. SQLite supports databases up to 281 terabytes, which is more than enough for the use case.
Phonology and Phonetics
Set of consonants, vowels, tones
The database for this would likely be mostly fixed in number of rows.
- IPA (
TEXT
): The International Phonetic Alphabet representation of the sound. - X_SAMPA (
TEXT
): The X-SAMPA equivalent, which allows user to type on a non-IPA keyboard - is_used (
BOOLEAN
): Whether the sound is used in the language
Grammar
According to Chomsky, a grammar consists of:
- vocabulary V
- a subset T of V consisting of terminal symbols -- we call this word list
- a subset N of V consisting of non-terminal symbols -- we call this word classes (e.g. nouns, adjectives, noun phrase)
- a start symbol S -- this is usually a phrase or sentence
- a finite set of productions P 1
In this project, grammar only means the production rules.
Each production rule is characterized by following columns:
- name (
TEXT
, unique): The human-friendly identifier for the rule - transformation_syntax(
INTEGER
): id for the transformation syntax - before (
TEXT
): the strings to be transformation - after (
TEXT
) The string after the transformation - description (
TEXT
): Rule description to be added to the document
There are two transformation syntax (more elaborated at Implementation chapter):
- RegEx
- C-style string format
Each kind of production rule is represented in its own table and is described in following subsections.
Inflection
Inflections are usually unique to only one part of speech, so they need a column for the part of speech.
Extra columns:
- part_of_speech (
INTEGER
): ID of the part of speech
Phrase Syntax
TBD
Derivation
Derivation is like inflection, except that it usually changes its part of speech. By default, derivation rule is disabled.
Extra column:
- part_of_speech_before (
INTEGER
): ID of the part of speech the rule apply to - part_of_speech_after (
INTEGER
): ID of the part of speech the rule transform the verb to
Writing system
Orthography rules
The table for orthography rules divides in two categories: hard rules and soft rules. Hard rules are enforced by the program to check if a newly added word follows the rule, and if a body of text follows such rule. Soft rules are human-readable rules that will be exported into document. The hard rules can be defined using RegEx, anti-RegEx (matching texts that are disallowed), BNF, or EBNF.
Columns:
- type (
TEXT
): regex, anti-regex, bnf, ebnf, soft - rule (
TEXT
)
Scripts
Columns:
- name (
TEXT
) - glyph (
BLOB
): the content of the vector file
Vocabulary
Part of Speech & Word Class
They share the same structure
Columns:
- name (
TEXT
)
Word List
Columns:
- word (
TEXT
) - part_of_speech (
INTEGER
): part of speech ID - word_class (
INTEGER
): word class ID - definition (
TEXT
): Can be translation into a natural language, or the native conlang -- this is up to the user.
-
Kenneth H. Rosen, Discrete Mathematics and Its Applications ↩︎