Continue writing about database and start on implementation design
This commit is contained in:
parent
bf41d6f9a4
commit
306360366e
56
db.md
56
db.md
|
@ -10,7 +10,7 @@ any modern programming language.
|
|||
On top of that, it is natively supported on both Android and iOS.
|
||||
It is also suitable for a custom [application file format][app-format]. In fact,
|
||||
Anki used SQLite for its flashcard deck file.
|
||||
SQLite supports databases upto [281 terabytes][size], which is more than enough
|
||||
SQLite supports databases up to [281 terabytes][size], which is more than enough
|
||||
for the use case.
|
||||
|
||||
[app-format]: https://sqlite.org/appfileformat.html
|
||||
|
@ -22,24 +22,50 @@ for the use case.
|
|||
The database for this would likely be mostly fixed in number of rows.
|
||||
|
||||
- IPA (`TEXT`): The International Phonetic Alphabet representation of the sound.
|
||||
- X-SAMPA (`TEXT`): The X-SAMPA equivalent, which allows user to type on a non-IPA keyboard
|
||||
- X_SAMPA (`TEXT`): The X-SAMPA equivalent, which allows user to type on a non-IPA keyboard
|
||||
- is_used (`BOOLEAN`): Whether the sound is used in the language
|
||||
|
||||
## Grammar
|
||||
### Inflection
|
||||
### Syntax
|
||||
## Morphology
|
||||
### Affixes
|
||||
### Part of Speech
|
||||
|
||||
According to Chomsky, a grammar consists of:
|
||||
|
||||
- vocabulary V
|
||||
- a subset T of V consisting of terminal symbols -- we call this word list
|
||||
- a subset N of V consisting of non-terminal symbols -- we call this word classes
|
||||
(e.g. nouns, adjectives, noun phrase)
|
||||
- a start symbol S -- this is usually a phrase or sentence
|
||||
- a finite set of productions P [^1]
|
||||
|
||||
In this project, grammar only means the production rules.
|
||||
|
||||
The table for grammar rules can thus be defined by 3 columns:
|
||||
|
||||
- name (`TEXT`, unique): The human-friendly identifier for the rule
|
||||
- production_type (`INTEGER`): id for the production type
|
||||
- transformation_syntax(`INTEGER`): id for the transformation syntax
|
||||
- before (`TEXT`): the strings to be transformation
|
||||
- after (`TEXT`) The string after the transformation
|
||||
|
||||
To represent the production type and transformation syntax, we need two other write-only
|
||||
tables.
|
||||
|
||||
There are three production types:
|
||||
|
||||
- inflection: for example, how a verb conjugates or how a noun declines
|
||||
- phrase syntax: the syntax for a type of phrase or sentence
|
||||
- derivational: how a word can transform to another word
|
||||
|
||||
There are two transformation syntax (more elaborated at
|
||||
[Implementation](implementation.md) section):
|
||||
|
||||
- RegEx
|
||||
- C-style string format
|
||||
|
||||
## Writing system
|
||||
### Orthography rules
|
||||
### Scripts
|
||||
## Vocabulary
|
||||
## Others
|
||||
### Part of Speech
|
||||
### Word List
|
||||
|
||||
Not all data can be represented as structured data.
|
||||
For example:
|
||||
|
||||
- Image
|
||||
- Audio
|
||||
|
||||
These kinds of data are stored as file blobs in a dedicated SQLite table.
|
||||
[^1]: Kenneth H. Rosen, Discrete Mathematics and Its Applications
|
||||
|
|
|
@ -0,0 +1,38 @@
|
|||
# Implementation
|
||||
|
||||
## Grammar
|
||||
|
||||
Grammar is a very complex issue in linguistics, it is certainly hard
|
||||
to represent it structurally. This design thus likely does not cover
|
||||
all grammatical constructions. It might be rather Eurocentric,
|
||||
and probably does not cover many languages whose grammar I'm not familiar with, such as:
|
||||
|
||||
- Korean
|
||||
- Arabic (all dialects)
|
||||
- Swahili
|
||||
- Nahuatl
|
||||
- Lojban
|
||||
- Sign languages
|
||||
|
||||
I would be happy to extend (either by myself, or merging contributions) the system
|
||||
to be able to represent those languages once the project is stable enough.
|
||||
|
||||
### Inflection
|
||||
|
||||
Inflections, at least in the majority of Indo-European languages,
|
||||
occur as prefixes or suffixes. We should not exclude the possibility of other types of inflection:
|
||||
|
||||
- Circumfix: haben -- **ge**hab**t** (German)
|
||||
- Simulfix: goose -- geese (English, also known as umlaut or ablaut)
|
||||
- German Trennbarverb: einschlafen -- Ich schlafe ein.
|
||||
- Infix: No example found yet
|
||||
- Reduplication
|
||||
|
||||
I propose two formats to store inflection rules:
|
||||
|
||||
- C-style string format, e.g. `%Sen` would signifiy the stem is followed by *en*.
|
||||
- Example: transformation `%Sen` --> `%St` would turns *haben* to *habt* and *liegen* to *liegt*
|
||||
It also turns *senden* into **sendt*.
|
||||
- RegEx, e.g. `oo` matches the first substring with oo and transform
|
||||
- Example: transformation `oo` --> `ee` would turns *foot* into *feet* and *tooth* into *teeth*
|
||||
It also turns *book* into **beek*
|
Loading…
Reference in New Issue