From 306360366e8ba5fd001d5454ee453d67f9d1c995 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ng=C3=B4=20Ng=E1=BB=8Dc=20=C4=90=E1=BB=A9c=20Huy?=
 <huyngo@disroot.org>
Date: Sat, 6 Mar 2021 20:02:56 +0700
Subject: [PATCH] Continue writing about database and start on implementation
 design

---
 db.md             | 56 ++++++++++++++++++++++++++++++++++-------------
 implementation.md | 38 ++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+), 15 deletions(-)
 create mode 100644 implementation.md

diff --git a/db.md b/db.md
index 4643970..efb0bea 100644
--- a/db.md
+++ b/db.md
@@ -10,7 +10,7 @@ any modern programming language.
 On top of that, it is natively supported on both Android and iOS.
 It is also suitable for a custom [application file format][app-format]. In fact,
 Anki used SQLite for its flashcard deck file.
-SQLite supports databases upto [281 terabytes][size], which is more than enough
+SQLite supports databases up to [281 terabytes][size], which is more than enough
 for the use case.
 
 [app-format]: https://sqlite.org/appfileformat.html
@@ -22,24 +22,50 @@ for the use case.
 The database for this would likely be mostly fixed in number of rows.
 
 - IPA (`TEXT`): The International Phonetic Alphabet representation of the sound.
-- X-SAMPA (`TEXT`): The X-SAMPA equivalent, which allows user to type on a non-IPA keyboard
+- X_SAMPA (`TEXT`): The X-SAMPA equivalent, which allows user to type on a non-IPA keyboard
 - is_used (`BOOLEAN`): Whether the sound is used in the language
 
 ## Grammar
-### Inflection
-### Syntax
-## Morphology
-### Affixes
-### Part of Speech
+
+According to Chomsky, a grammar consists of:
+
+- vocabulary V
+	- a subset T of V consisting of terminal symbols -- we call this word list
+	- a subset N of V consisting of non-terminal symbols -- we call this word classes
+		(e.g. nouns, adjectives, noun phrase)
+- a start symbol S -- this is usually a phrase or sentence
+- a finite set of productions P [^1]
+
+In this project, grammar only means the production rules.
+
+The table for grammar rules can thus be defined by 3 columns:
+
+- name (`TEXT`, unique): The human-friendly identifier for the rule
+- production_type (`INTEGER`): id for the production type
+- transformation_syntax(`INTEGER`): id for the transformation syntax
+- before (`TEXT`): the strings to be transformation
+- after (`TEXT`) The string after the transformation
+
+To represent the production type and transformation syntax, we need two other write-only
+tables.
+
+There are three production types:
+
+- inflection: for example, how a verb conjugates or how a noun declines
+- phrase syntax: the syntax for a type of phrase or sentence
+- derivational: how a word can transform to another word
+
+There are two transformation syntax (more elaborated at
+[Implementation](implementation.md) section):
+
+- RegEx
+- C-style string format
+
 ## Writing system
+### Orthography rules
 ### Scripts
 ## Vocabulary
-## Others
+### Part of Speech
+### Word List
 
-Not all data can be represented as structured data.
-For example:
-
-- Image
-- Audio
-
-These kinds of data are stored as file blobs in a dedicated SQLite table.
+[^1]: Kenneth H. Rosen, Discrete Mathematics and Its Applications
diff --git a/implementation.md b/implementation.md
new file mode 100644
index 0000000..6746ef8
--- /dev/null
+++ b/implementation.md
@@ -0,0 +1,38 @@
+# Implementation
+
+## Grammar
+
+Grammar is a very complex issue in linguistics, it is certainly hard
+to represent it structurally. This design thus likely does not cover
+all grammatical constructions. It might be rather Eurocentric,
+and probably does not cover many languages whose grammar I'm not familiar with, such as:
+
+- Korean
+- Arabic (all dialects)
+- Swahili
+- Nahuatl
+- Lojban
+- Sign languages
+
+I would be happy to extend (either by myself, or merging contributions) the system
+to be able to represent those languages once the project is stable enough.
+
+### Inflection
+
+Inflections, at least in the majority of Indo-European languages,
+occur as prefixes or suffixes. We should not exclude the possibility of other types of inflection:
+
+- Circumfix: haben -- **ge**hab**t** (German)
+- Simulfix: goose -- geese (English, also known as umlaut or ablaut)
+- German Trennbarverb: einschlafen -- Ich schlafe ein.
+- Infix: No example found yet
+- Reduplication
+
+I propose two formats to store inflection rules:
+
+- C-style string format, e.g. `%Sen` would signifiy the stem is followed by *en*.
+	- Example: transformation `%Sen` --> `%St` would turns *haben* to *habt* and *liegen* to *liegt*
+		It also turns *senden* into **sendt*.
+- RegEx, e.g. `oo` matches the first substring with oo and transform
+	- Example: transformation `oo` --> `ee` would turns *foot* into *feet* and *tooth* into *teeth*
+		It also turns *book* into **beek*