implement input URLs & convert them. implement combine URLS for -dir too.

2020-04-24 18:11:34 -03:00 · 2020-04-24 18:11:34 -03:00 · ed7802b8eb
parent bea232667d
commit ed7802b8eb
5 changed files with 103 additions and 45 deletions
--- a/.gitignore
+++ b/.gitignore
@ -135,4 +135,8 @@ dmypy.json
 .pytype/

 # Cython debug symbols
-cython_debug/
+cython_debug/
+
+# mouse additions:
+.*#
+.*~
--- a/README.md
+++ b/README.md
@ -1,4 +1,3 @@
-
 ## disclaimer

 i wrote this cli rapper for the `markovify` python module because i wanted its features to be available as a cli tool.
@ -11,7 +10,7 @@ maybe this functionality already exists somewhere, but i couldn't find it. if it

 ## mkv-this

-`mkv-this` is a little script that outputs a bunch of bot-like sentences based on a bank of text that you feed it. the results are saved to a text file. if you run it again with the same output file, the new results are appended after the old ones.
+`mkv-this` is a script that outputs a bunch of bot-like sentences based on a bank of text that you feed it, either from a local text file or a URL, and saves the results to a text file. if you run it again with the same output file, the new results are appended after the old ones.

 a second command, `mkv-this-dir` (see below) allows you to input a directory and it will read all text files within it as the input.

@ -44,21 +43,18 @@ the script implements a number of the basic `markovify` options, so you can spec
 * a maximum sentence length, in characters.
 * the amount of (verbatim) overlap allowed between input and output.
 * if your text's sentences end with newlines rather than full-stops.
-* an additional file to use for text input. you can add only one. if you want to feed a stack of files into your bank, use `mkv-this-dir`.
+* an additional file or URL to use for text input. you can add only one. if you want to feed a stack of files into your bank, use `mkv-this-dir`.
 * the relative weight to give to the second file if it is used.

-as of 0.1.29 you can also specify:
-
-* a URL to a text file online. (you can input something that isn't a text file but the results will be mush or the programme will crash.)
-* an additional URL to use as text input.
-
 run `mkv-this -h` to see how to use these options.

 ### mkv-this-dir: markovify a directory of text files

-`mkv-this` can only take two files as input. if you want to input a stack of files, use `mkv-this-dir`. specify a directory and all text files in it will be used as input.
+if you want to input a stack of files, use `mkv-this-dir` instead. specify a directory and all text files in it will be used as input.

-if for some reason you want to get a similar funtionality with `mkv-this`, you can easily concatenate files yourself from the command line, then process them:
+as with `mkv-this` you can also combine this directory with a URL.
+
+if for some reason you want to get a similar funtionality with `mkv-this`, you can easily concatenate the files yourself from the command line, then process the resulting file:

 * copy all your text files into a directory
 * cd into the directory
@ -72,7 +68,7 @@ you need to input plain text files. currently accepted file extensions are `.txt

 ### for best results

-feed `mkv-this` large-ish amounts of well punctuated text. it works best if you bulk replace/remove as much mess as possible (URLs, code, HTML tags, metadata, stars, bullets, lines, etc.), unless you want mashed versions of those things in your output.
+feed `mkv-this` large-ish amounts of well punctuated text. it works best if you bulk replace/remove as much mess as possible (URLs, code, HTML tags, metadata, stars, bullets, lines, etc.), unless you want mashed versions of those things in your output. (no need to clean up URLs though.)

 you’ll probably want to edit or select things from the output. it is very much supposed to be a kind of raw material rather than print-ready boilerplate bosh, although many bots are happily publishing such output directly. you might find that it prompts you to edit it like a bot yourself.

@ -91,5 +87,4 @@ i know nothing about macs so if you ask me for help i'll just send you random co
 ### todo

 * option to also append input model to a saved JSON file. (i.e. `text_model.to_json()`, `markovify.Text.from_json()`)
-* maybe some copy in some basic webscraping boilerplate code.
 * learn how to programme.
--- a/mkv_this/mkv_this.py
+++ b/mkv_this/mkv_this.py
@ -28,7 +28,7 @@ import requests
 import markovify
 import sys
 import argparse
-import json
+import html2text

 # argparse
 def parse_the_args():
@ -37,9 +37,9 @@ def parse_the_args():

    # positional args:
    parser.add_argument(
-        'infile', help="the text file to process, with path. NB: file cannot be empty.")
+        'infile', help="the text file to process. NB: file cannot be empty.")
    parser.add_argument('outfile', nargs='?', default="./mkv-output.txt",
-                        help="the file to save to, with path. if the file is used more than once, subsequent literature will be appended to the file after a star. defaults to ./mkv-output.txt.")
+                        help="the file to save to. if the file is used more than once, subsequent literature will be appended to it. defaults to ./mkv-output.txt.")

    # optional args:
    parser.add_argument('-s', '--state-size', help="the number of preceeding words used to calculate the probability of the next word. defaults to 2, 1 makes it more random, 3 less so. > 4 will likely have little effect.", type=int, default=2)
@ -52,12 +52,12 @@ def parse_the_args():
    parser.add_argument(
        '-c', '--combine', help="provide an another text file to be combined with the first item.")
    parser.add_argument('-C', '--combine-URL',
-                        help="provide an additional URL to be combined with the first item")
+                        help="provide a URL to be combined with the first item")
    parser.add_argument('-w', '--weight', help="specify the weight to be given to the text provided with -c or -C. defaults to 1, and the weight of the initial text is 1. 1.5 will place more weight on the second text, 0.5 will place less.", type=float, default=1)

    # switches
    parser.add_argument(
-        '-u', '--URL', help="infile is a URL. NB: for this to work it should be the location of a text file.", action='store_true')
+        '-u', '--URL', help="infile is a URL instead.", action='store_true')
    parser.add_argument('-f', '--no-well-formed', help="don't enforce 'well_formed': allow the inclusion of sentences containing []{}()""'' in the markov model. might filth up your text, eg if it contains 'smart' quotes.", action='store_false')
    # store_false = default to True.
    parser.add_argument(
@ -67,7 +67,7 @@ def parse_the_args():

    return parser.parse_args()

-# read/build/write fns:
+# fetch/read/build/write fns:


 def URL(insert):
@ -75,13 +75,21 @@ def URL(insert):
        req = requests.get(insert)
        req.raise_for_status()
    except Exception as exc:
-        print(f'There was a problem: {exc}')
+        print(f': There was a problem: {exc}.\n: Please enter a valid URL')
        sys.exit()
    else:
-        print('text fetched from URL.')
+        print(': fetched URL.')
        return req.text


+def convert_html(html):
+    h2t = html2text.HTML2Text()
+    h2t.ignore_links = True
+    h2t.ignore_images = True
+    print(': URL converted to text')
+    return h2t.handle(html)
+
+
 def read(infile):
    try:
        with open(infile, encoding="utf-8") as f:
@ -123,7 +131,7 @@ def writesentence(tmodel):

 # make args + fnf avail to all:
 args = parse_the_args()
-fnf = 'error: file not found. please provide a path to a really-existing \
+fnf = ': error: file not found. please provide a path to a really-existing \
    file!'


@ -135,7 +143,8 @@ def main():
            #            try:
            # infile is URL:
            if args.URL:
-                text = URL(args.infile)
+                html = URL(args.infile)
+                text = convert_html(html)
            # or normal:
            else:
                text = read(args.infile)
@ -147,12 +156,14 @@ def main():
            #            try:
            # infile is URL:
            if args.URL:
-                text = URL(args.infile)
+                html = URL(args.infile)
+                text = convert_html(html)
                # or normal:
            else:
                text = read(args.infile)
            # now combine_URL:
-            ctext = URL(args.combine_URL)
+            html = URL(args.combine_URL)
+            ctext = convert_html(html)

        # build the models + a combined model:
        # with --newline:
@ -174,10 +185,10 @@ def main():
        # Get raw text as string.
        # either URL:
        if args.URL:
-            text = URL(args.infile)
+            html = URL(args.infile)
+            text = convert_html(html)
        # or local:
        else:
-            #            try:
            text = read(args.infile)

        # Build the model:
@ -194,10 +205,10 @@ def main():
    for key, value in vars(args).items():
        print(': ' + key.ljust(15, ' ') + ':  ' + str(value).ljust(10))
    if os.path.isfile(args.outfile):
-        print('\n:  literary genius has been written to the file '
-              + args.outfile + '. thanks for playing!\n\n: Here, this is not at all the becomings that are connected... so if you want to edit it like a bot yourself, it is trivial. Yes, although your very smile suggests that this Armenian enclave is not at all the becomings that are connected...')
+        print("\n:  literary genius has been written to the file "
+              + args.outfile + ". thanks for playing!\n\n: 'Here, this is not at all the becomings that are connected... so if you want to edit it like a bot yourself, it is trivial. Yes, although your very smile suggests that this Armenian enclave is not at all the becomings that are connected...'")
    else:
-        print('mkv-this ran but did NOT create an output file as requested. this is a very regrettable and dangerous situation. contact the package maintainer asap. soz!')
+        print(': mkv-this ran but did NOT create an output file as requested. this is a very regrettable and dangerous situation. contact the package maintainer asap. soz!')

    sys.exit()

--- a/mkv_this/mkv_this_dir.py
+++ b/mkv_this/mkv_this_dir.py
@ -25,7 +25,8 @@ import os
 import markovify
 import sys
 import argparse
-
+import html2text
+import requests

 # argparse
 def parse_the_args():
@ -41,7 +42,7 @@ def parse_the_args():
    parser.add_argument('-n', '--sentences', help="the number of 'sentences' to output. defaults to 5.", type=int, default=5)
    parser.add_argument('-l', '--length', help="set maximum number of characters per sentence.", type=int)
    parser.add_argument('-o', '--overlap', help="the amount of overlap allowed between original text and the output, expressed as a radio between 0 and 1. lower values make it more random. defaults to 0.5", type=float, default=0.5)
-    parser.add_argument('-c', '--combine', help="provide an another input text file with path to be combined with the input directory.")
+    parser.add_argument('-C', '--combine-URL', help="provide a URL to be combined with the input dir")
    parser.add_argument('-w', '--weight', help="specify the weight to be given to the second text provided with --combine. defaults to 1, and the weight of the initial text is also 1. setting this to 1.5 will place 50 percent more weight on the second text. setting it to 0.5 will place less.", type=float, default=1)

    # switches
@ -52,7 +53,29 @@ def parse_the_args():

    return parser.parse_args()

-# read, build, write fns:
+# retch, read, build, write fns:
+
+
+def URL(insert):
+    try:
+        req = requests.get(insert)
+        req.raise_for_status()
+    except Exception as exc:
+        print(f': There was a problem: {exc}.\n: Please enter a valid URL')
+        sys.exit()
+    else:
+        print(': fetched URL.')
+        return req.text
+
+
+def convert_html(html):
+    h2t = html2text.HTML2Text()
+    h2t.ignore_links = True
+    h2t.ignore_images = True
+    print(': URL converted to text')
+    return h2t.handle(html)
+
+
 def read(infile):
    try:
        with open(infile, encoding="utf-8") as f:
@ -100,8 +123,10 @@ def main():
            for filename in filenames:
                if filename.endswith(('.txt', '.org', '.md')):
                    matches.append(os.path.join(root, filename))
+        print(': text files fetched and combined')
    else:
-        print('error: please enter a valid directory')
+        print(': error: please enter a valid directory')
+        sys.exit()

    # place batchfile.txt in user-given directory:
    batchfile = os.path.dirname(args.indir) + os.path.sep + 'batchfile.txt'
@ -119,16 +144,38 @@ def main():

    # Get raw text from batchfile as string.
    text = read(batchfile)
-    
-    # Build model:
-    # if --newline:
-    if args.newline:
-        text_model = mkbnewline(text)
-    # no --newline:
-    else:
-        text_model = mkbtext(text)

-    writesentence(text_model)
+    if args.combine_URL:
+        html = URL(args.combine_URL)
+        ctext = convert_html(html)
+    
+        # Build combo model:
+        # if --newline:
+        if args.newline:
+            text_model = mkbnewline(text)
+            ctext_model = mkbnewline(ctext)
+        # no --newline:
+        else:
+            text_model = mkbtext(text)
+            ctext_model = mkbtext(ctext)
+
+        combo_model = markovify.combine(
+            [text_model, ctext_model], [1, args.weight])
+
+        writesentence(combo_model)
+        
+    # no combining:
+    else:
+        # Build model:
+        # if --newline:
+        if args.newline:
+            text_model = mkbnewline(text)
+        # no --newline:
+        else:
+            text_model = mkbtext(text)
+
+        writesentence(text_model)
+        
    os.unlink(batchfile)

    print('\n: The options you used are as follows:\n')
@ -138,7 +185,7 @@ def main():
        print('\n:  literary genius has been written to the file '
              + args.outfile + '. thanks for playing!\n\n: Here, this is not at all the becomings that are connected... so if you want to edit it like a bot yourself, it is trivial. Yes, although your very smile suggests that this Armenian enclave is not at all the becomings that are connected...')
    else:
-        print('mkv-this ran but did NOT create an output file as requested. this is a very regrettable and dangerous situation. contact the package maintainer asap. soz!')
+        print(': mkv-this ran but did NOT create an output file as requested. this is a very regrettable and dangerous situation. contact the package maintainer asap. soz!')

    sys.exit()
    
--- a/setup.py
+++ b/setup.py
@ -7,7 +7,7 @@ with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
    long_description = f.read()
    
 setup(name='mkv-this',
-      version='0.1.32',
+      version='0.1.33',
      description='cli wrapper for markovify: take a text file, markovify, output the results to a text file.',
      long_description=long_description,
      long_description_content_type='text/markdown',
@ -25,6 +25,7 @@ setup(name='mkv-this',
      install_requires=[
          'markovify',
          'argparse',
+          'html2text',
      ],
      zip_safe=False,
 )