Continue writing about database and start on implementation design

This commit is contained in:
Ngô Ngọc Đức Huy 2021-03-06 20:02:56 +07:00
parent bf41d6f9a4
commit 306360366e
Signed by: huyngo
GPG Key ID: 904AF1C7CDF695C3
2 changed files with 79 additions and 15 deletions

56
db.md
View File

@ -10,7 +10,7 @@ any modern programming language.
On top of that, it is natively supported on both Android and iOS.
It is also suitable for a custom [application file format][app-format]. In fact,
Anki used SQLite for its flashcard deck file.
SQLite supports databases upto [281 terabytes][size], which is more than enough
SQLite supports databases up to [281 terabytes][size], which is more than enough
for the use case.
[app-format]: https://sqlite.org/appfileformat.html
@ -22,24 +22,50 @@ for the use case.
The database for this would likely be mostly fixed in number of rows.
- IPA (`TEXT`): The International Phonetic Alphabet representation of the sound.
- X-SAMPA (`TEXT`): The X-SAMPA equivalent, which allows user to type on a non-IPA keyboard
- X_SAMPA (`TEXT`): The X-SAMPA equivalent, which allows user to type on a non-IPA keyboard
- is_used (`BOOLEAN`): Whether the sound is used in the language
## Grammar
### Inflection
### Syntax
## Morphology
### Affixes
### Part of Speech
According to Chomsky, a grammar consists of:
- vocabulary V
- a subset T of V consisting of terminal symbols -- we call this word list
- a subset N of V consisting of non-terminal symbols -- we call this word classes
(e.g. nouns, adjectives, noun phrase)
- a start symbol S -- this is usually a phrase or sentence
- a finite set of productions P [^1]
In this project, grammar only means the production rules.
The table for grammar rules can thus be defined by 3 columns:
- name (`TEXT`, unique): The human-friendly identifier for the rule
- production_type (`INTEGER`): id for the production type
- transformation_syntax(`INTEGER`): id for the transformation syntax
- before (`TEXT`): the strings to be transformation
- after (`TEXT`) The string after the transformation
To represent the production type and transformation syntax, we need two other write-only
tables.
There are three production types:
- inflection: for example, how a verb conjugates or how a noun declines
- phrase syntax: the syntax for a type of phrase or sentence
- derivational: how a word can transform to another word
There are two transformation syntax (more elaborated at
[Implementation](implementation.md) section):
- RegEx
- C-style string format
## Writing system
### Orthography rules
### Scripts
## Vocabulary
## Others
### Part of Speech
### Word List
Not all data can be represented as structured data.
For example:
- Image
- Audio
These kinds of data are stored as file blobs in a dedicated SQLite table.
[^1]: Kenneth H. Rosen, Discrete Mathematics and Its Applications

38
implementation.md Normal file
View File

@ -0,0 +1,38 @@
# Implementation
## Grammar
Grammar is a very complex issue in linguistics, it is certainly hard
to represent it structurally. This design thus likely does not cover
all grammatical constructions. It might be rather Eurocentric,
and probably does not cover many languages whose grammar I'm not familiar with, such as:
- Korean
- Arabic (all dialects)
- Swahili
- Nahuatl
- Lojban
- Sign languages
I would be happy to extend (either by myself, or merging contributions) the system
to be able to represent those languages once the project is stable enough.
### Inflection
Inflections, at least in the majority of Indo-European languages,
occur as prefixes or suffixes. We should not exclude the possibility of other types of inflection:
- Circumfix: haben -- **ge**hab**t** (German)
- Simulfix: goose -- geese (English, also known as umlaut or ablaut)
- German Trennbarverb: einschlafen -- Ich schlafe ein.
- Infix: No example found yet
- Reduplication
I propose two formats to store inflection rules:
- C-style string format, e.g. `%Sen` would signifiy the stem is followed by *en*.
- Example: transformation `%Sen` --> `%St` would turns *haben* to *habt* and *liegen* to *liegt*
It also turns *senden* into **sendt*.
- RegEx, e.g. `oo` matches the first substring with oo and transform
- Example: transformation `oo` --> `ee` would turns *foot* into *feet* and *tooth* into *teeth*
It also turns *book* into **beek*