From 306360366e8ba5fd001d5454ee453d67f9d1c995 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ng=C3=B4=20Ng=E1=BB=8Dc=20=C4=90=E1=BB=A9c=20Huy?= Date: Sat, 6 Mar 2021 20:02:56 +0700 Subject: [PATCH] Continue writing about database and start on implementation design --- db.md | 56 ++++++++++++++++++++++++++++++++++------------- implementation.md | 38 ++++++++++++++++++++++++++++++++ 2 files changed, 79 insertions(+), 15 deletions(-) create mode 100644 implementation.md diff --git a/db.md b/db.md index 4643970..efb0bea 100644 --- a/db.md +++ b/db.md @@ -10,7 +10,7 @@ any modern programming language. On top of that, it is natively supported on both Android and iOS. It is also suitable for a custom [application file format][app-format]. In fact, Anki used SQLite for its flashcard deck file. -SQLite supports databases upto [281 terabytes][size], which is more than enough +SQLite supports databases up to [281 terabytes][size], which is more than enough for the use case. [app-format]: https://sqlite.org/appfileformat.html @@ -22,24 +22,50 @@ for the use case. The database for this would likely be mostly fixed in number of rows. - IPA (`TEXT`): The International Phonetic Alphabet representation of the sound. -- X-SAMPA (`TEXT`): The X-SAMPA equivalent, which allows user to type on a non-IPA keyboard +- X_SAMPA (`TEXT`): The X-SAMPA equivalent, which allows user to type on a non-IPA keyboard - is_used (`BOOLEAN`): Whether the sound is used in the language ## Grammar -### Inflection -### Syntax -## Morphology -### Affixes -### Part of Speech + +According to Chomsky, a grammar consists of: + +- vocabulary V + - a subset T of V consisting of terminal symbols -- we call this word list + - a subset N of V consisting of non-terminal symbols -- we call this word classes + (e.g. nouns, adjectives, noun phrase) +- a start symbol S -- this is usually a phrase or sentence +- a finite set of productions P [^1] + +In this project, grammar only means the production rules. + +The table for grammar rules can thus be defined by 3 columns: + +- name (`TEXT`, unique): The human-friendly identifier for the rule +- production_type (`INTEGER`): id for the production type +- transformation_syntax(`INTEGER`): id for the transformation syntax +- before (`TEXT`): the strings to be transformation +- after (`TEXT`) The string after the transformation + +To represent the production type and transformation syntax, we need two other write-only +tables. + +There are three production types: + +- inflection: for example, how a verb conjugates or how a noun declines +- phrase syntax: the syntax for a type of phrase or sentence +- derivational: how a word can transform to another word + +There are two transformation syntax (more elaborated at +[Implementation](implementation.md) section): + +- RegEx +- C-style string format + ## Writing system +### Orthography rules ### Scripts ## Vocabulary -## Others +### Part of Speech +### Word List -Not all data can be represented as structured data. -For example: - -- Image -- Audio - -These kinds of data are stored as file blobs in a dedicated SQLite table. +[^1]: Kenneth H. Rosen, Discrete Mathematics and Its Applications diff --git a/implementation.md b/implementation.md new file mode 100644 index 0000000..6746ef8 --- /dev/null +++ b/implementation.md @@ -0,0 +1,38 @@ +# Implementation + +## Grammar + +Grammar is a very complex issue in linguistics, it is certainly hard +to represent it structurally. This design thus likely does not cover +all grammatical constructions. It might be rather Eurocentric, +and probably does not cover many languages whose grammar I'm not familiar with, such as: + +- Korean +- Arabic (all dialects) +- Swahili +- Nahuatl +- Lojban +- Sign languages + +I would be happy to extend (either by myself, or merging contributions) the system +to be able to represent those languages once the project is stable enough. + +### Inflection + +Inflections, at least in the majority of Indo-European languages, +occur as prefixes or suffixes. We should not exclude the possibility of other types of inflection: + +- Circumfix: haben -- **ge**hab**t** (German) +- Simulfix: goose -- geese (English, also known as umlaut or ablaut) +- German Trennbarverb: einschlafen -- Ich schlafe ein. +- Infix: No example found yet +- Reduplication + +I propose two formats to store inflection rules: + +- C-style string format, e.g. `%Sen` would signifiy the stem is followed by *en*. + - Example: transformation `%Sen` --> `%St` would turns *haben* to *habt* and *liegen* to *liegt* + It also turns *senden* into **sendt*. +- RegEx, e.g. `oo` matches the first substring with oo and transform + - Example: transformation `oo` --> `ee` would turns *foot* into *feet* and *tooth* into *teeth* + It also turns *book* into **beek*