added --newline flag and permissive encoding rule for reading files in mkv-this-dir

2020-04-20 12:21:18 -03:00 · 2020-04-20 12:21:18 -03:00 · d15f5f60d3
parent d58b257850
commit d15f5f60d3
4 changed files with 41 additions and 20 deletions
--- a/README.md
+++ b/README.md
@ -9,7 +9,7 @@ it simply makes some of the features of the `markovify` python module available

 it was written by a total novice, so you probably shouldn’t download it. i only learned about `argparser` yesterday, and pypi.org today, no matter what day it is. tomorrow i might learn about `os` and `sys`.

-#### installing:
+### installing:

 install it with `pip`, the python package manager:

@ -25,17 +25,17 @@ to do this you need `python3` and `pip`. if you don't have them, install them th

 `markovify` is also a dependency, but it should install along with `mkv-this`.

-#### macos
+### macos

 it seems to run on macos too.

-you may already have python installed. if not, you first need to install [https://brew.sh/#install](homebrew), edit your PATH so that it works, then install `python3` with `brew install python3`. if you are already running an old version of `homebrew` you might need to run `brew install python3 && brew postinstall python3` to get `python3` and `pip` running right.
+you may already have python installed. if not, you first need to install [homebrew](https://brew.sh/#install), edit your PATH so that it works, then install `python3` with `brew install python3`. if you are already running an old version of `homebrew` you might need to run `brew install python3 && brew postinstall python3` to get `python3` and `pip` running right.

-either way, check if `pip` is installed with `pip --version`, or `pip3 --version`.
+you can check if `pip` is installed with `pip --version`, or `pip3 --version`.

 i know nothing about macs so if you ask me for help i'll just send you random copypasta from the interwebs.

-#### options:
+### options:

 the script implements a few of the basic `markovify` options, so you can:

@ -44,12 +44,13 @@ the script implements a few of the basic `markovify` options, so you can:
 * specify how many sentences to output (default = 5)
 * specify state size, i.e. the number of preceeding words to be used in calculating the probability of the next word (default = 2).
 * specify the amount of (verbatim) overlap allowed between your input text and your output text.
+* specify that your text's sentences end with newlines rather than full-stops.
 * specify an additional file to use for text input. you can add only one. if you want to feed a stack of files into your bank, use `mkv-this-dir`.
 * if a second file is added, you can also specify the relative weight to give to the two files.

 run `mkv-this -h` to see how to use these options.

-#### mkv-this-dir: markovify a directory of text files
+### mkv-this-dir: markovify a directory of text files

 `mkv-this` can only take two files as input material each time. if you want to input a stack of files, use `mkv-this-dir`. it allows you to specify a directory and all text files in it will be used as input material.

@ -59,12 +60,13 @@ if for some reason you want to get a similar funtionality with `mkv-this`, you c
 * cd into the directory
 * run `cat * > outputfile.txt`
 * run mkv-this on your newly created file: `mkv-this outputfile.txt`
+* this approach has the benefit of creating a file with encoding that mkv-this can certainly handle.

-#### file types accepted:
+### file types accepted:

 you need to input plain text files. currently accepted file extensions are `.txt`, `.org` and `.md`. it is trivial to add others, so if you want one included just ask.

-#### for best results:
+### for best results:

 feed `mkv-this` large-ish amounts of well punctuated text. it works best if you bulk replace/remove as much mess as possible (URLs, metadata, stars, bullets, etc.), unless you want mashed versions of those things in your output.

--- a/mkv_this/mkv_this.py
+++ b/mkv_this/mkv_this.py
@ -52,6 +52,7 @@ def main():
    parser.add_argument('-n', '--sentences', help="the number of 'sentences' to output. defaults to 5. must be an integer.", type=int, default=5)
    parser.add_argument('-l', '--length', help="set maximum number of characters per sentence. must be an integer.", type=int)
    parser.add_argument('-o', '--overlap', help="the amount of overlap allowed between original text and the output, expressed as a ratio between 0 and 1. defaults to 0.5", type=float, default=0.5)
+    parser.add_argument('--newline', help="sentences in input file end with newlines rather than with full stops.", action='store_true') # store_true means default to False, and becomes True if flagged.
    parser.add_argument('-c', '--combine', help="provide an another input text file with path to be combined with the first.")
    parser.add_argument('-w', '--weight', help="specify the weight to be given to the second text provided with --combine. defaults to 1, and the weight of the initial text is also 1. setting this to 1.5 will place 50 percent more weight on the second text, while setting it to 0.5 will place less.", type=float, default=1)

@ -66,9 +67,14 @@ def main():
        with open(args.combine, encoding="latin-1") as cf:
            ctext = cf.read()

-        #build the models and build a combined model:
-        text_model = markovify.Text(text, state_size=args.statesize)
-        ctext_model = markovify.Text(ctext, state_size=args.statesize)
+        # build the models and build a combined model:
+        # NB: attempting to implement Newline option here (and below):
+        if args.newline :
+            text_model = markovify.NewlineText(text, state_size=args.statesize)
+            ctext_model = markovify.NewlineText(ctext, state_size=args.statesize)
+        else:
+            text_model = markovify.Text(text, state_size=args.statesize)
+            ctext_model = markovify.Text(ctext, state_size=args.statesize)

        combo_model = markovify.combine([text_model, ctext_model], [1, args.weight])

@ -88,13 +94,18 @@ def main():

        # if no combo file, just do normal:
    else:
-    # Get raw text as string.
+        # Get raw text as string.
        with open(args.infile, encoding="latin-1") as f:
            text = f.read()
        
        # Build the model:
        # NB: this errors if infile is EMPTY:
-        text_model = markovify.Text(text, state_size=args.statesize)
+
+        ## implement newline option here:
+        if args.newline :
+            text_model = markovify.NewlineText(text, state_size=args.statesize)
+        else:
+            text_model = markovify.Text(text, state_size=args.statesize)

        # Print -n number of randomly-generated sentences
        for i in range(args.sentences):
--- a/mkv_this/mkv_this_dir.py
+++ b/mkv_this/mkv_this_dir.py
@ -20,6 +20,8 @@

 """
 a (very basic) script to collect all text files in a directory, markovify them and output a user-specified number of sentences to a text file.
+
+TODO: handle non-utf-8 encoded files.
 """
     
 import markovify
@ -41,6 +43,7 @@ def main():
    parser.add_argument('-n', '--sentences', help="the number of 'sentences' to output. defaults to 5.", type=int, default=5)
    parser.add_argument('-l', '--length', help="set maximum number of characters per sentence.", type=int)
    parser.add_argument('-o', '--overlap', help="the amount of overlap allowed between original text and the output, expressed as a radio between 0 and 1. lower values make it more random. defaults to 0.5", type=float, default=0.5)
+    parser.add_argument('--newline', help="sentences in input file end with newlines rather than with full stops.", action='store_true')
    parser.add_argument('-c', '--combine', help="provide an another input text file with path to be combined with the first.")
    parser.add_argument('-w', '--weight', help="specify the weight to be given to the second text provided with --combine. defaults to 1, and the weight of the initial text is also 1. setting this to 1.5 will place 50 percent more weight on the second text. setting it to 0.5 will place less.", type=float, default=1)

@ -53,21 +56,26 @@ def main():
            if filename.endswith(('.txt', '.org', '.md')):
                matches.append(os.path.join(root, filename))

-    # concatenate the files into batchfile.txt:
-    batchfile = os.path.dirname(args.indir) + os.path.sep + 'batchfile.txt' # SEEMS like it works?
+    # place batchfile.txt in user-given directory:
+    batchfile = os.path.dirname(args.indir) + os.path.sep + 'batchfile.txt'

+    # concatenate the files into batchfile.txt:
+    ###NB: trying to avoid encoding error here:
    with open(batchfile, 'w') as outfile:    
        for fname in matches:
-            with open(fname) as infile:
+            with open(fname, encoding="latin-1") as infile:
                outfile.write(infile.read())
+        outfile.close()

    # Get raw text from batchfile as string.
    with open(batchfile, 'r') as f:
        text = f.read()
    
    # Build the model:
-    # NB: this errors if infile is EMPTY:
-    text_model = markovify.Text(text, state_size=args.statesize)
+    if args.newline :
+        text_model = markovify.NewlineText(text, state_size=args.statesize)
+    else:
+        text_model = markovify.Text(text, state_size=args.statesize)

    # Print -n number of randomly-generated sentences
    for i in range(args.sentences):
@ -85,5 +93,5 @@ def main():

    os.unlink(batchfile)

-    print('\n:  literary genius has been written to the file ' + args.outfile + '. thanks for playing!')
+    print("\n:  literary genius has been written to the file '" + args.outfile + "'. thanks for playing!")
    sys.exit()
--- a/setup.py
+++ b/setup.py
@ -7,7 +7,7 @@ with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
    long_description = f.read()
    
 setup(name='mkv-this',
-      version='0.1.18',
+      version='0.1.23',
      description='markovify user-provided text and output the results to a text file.',
      long_description=long_description,
      long_description_content_type='text/markdown',