separate functions file, split writesentence in two, add multi page fetching fns also.

2020-04-25 19:56:24 -03:00 · 2020-04-25 19:56:24 -03:00 · 7063c8339b
parent d06f84a5f1
commit 7063c8339b
5 changed files with 213 additions and 189 deletions
--- a/README.md
+++ b/README.md
@ -1,8 +1,8 @@
 ## disclaimer

-`mkv-this` simply makes some of the features of the excellent [markovify](https://github.com/jsvine/markovify) module available as a command line tool. i started on it because i wanted to to process my own offline files. i published it to share with friends. i'm a totally novice coder. so you are a programmer and felt like picking it up and improving on it, then by all means!
+`mkv-this` makes some of the features of the excellent [markovify](https://github.com/jsvine/markovify) module available as a command line tool. i started on it because i wanted to process my own offline files the same way [fedibooks](https://fedibooks.com) can process mastodon toots. then i published it to share with friends. i'm a totally novice coder, so you are a programmer and felt like picking it up and improving on it, then by all means!

-the rest of these notes are for laypersons rather than programmers.
+the rest of these notes are for end users rather than programmers.

 ## mkv-this

@ -10,7 +10,7 @@ the rest of these notes are for laypersons rather than programmers.

 a second command, `mkv-this-dir` (see below) allows you to input a directory and it will read all text files within it as the input.

-both commands allow you to add a second file or URL to the input, so you can combine your secret diary with some vile shit from the web.
+both commands allow you to add a second file or URL to the input, so you can combine your secret diary with some vile shit from the webz.

 ### installing

@ -34,11 +34,11 @@ if you get sth like `ModuleNotFound error: No module named '$modulename'`, just

 the script implements a number of the basic `markovify` options, so you can specify:

-* how many sentences to output (default = 5).
+* how many sentences to output (default = 5; NB: if your input does not use initial capitals, each 'sentence' will be more like a paragraph). 
 * the state size, i.e. the number of preceding words to be used in calculating the choice of the next word (default = 2).
 * a maximum sentence length, in characters.
 * the amount of (verbatim) overlap allowed between input and output.
-* if your text's sentences end with newlines rather than full-stops.
+* if your text's sentences end with newlines rather than full-stops. handy for inputting poetry.
 * an additional file or URL to use for text input. you can add only one. if you want to feed a stack of files into your bank, use `mkv-this-dir` (see below).
 * the relative weight to give to the second file if it is used.

@ -48,7 +48,7 @@ run `mkv-this -h` to see how to use these options.

 if you want to input a stack of files, use `mkv-this-dir` instead. specify a directory and all text files in it will be used as input.

-as with `mkv-this` you can also combine it with a URL.
+as with `mkv-this` you can also combine your directory with a URL.

 if for some reason you want to get a similar funtionality with `mkv-this`, you can easily concatenate the files yourself from the command line, then process the resulting file:

@ -62,11 +62,13 @@ if for some reason you want to get a similar funtionality with `mkv-this`, you c

 you need to input plain text files. currently accepted file extensions are `.txt`, `.org` and `.md`. it is trivial to add others, so if you want one included just ask.

+if you don't have text files, but odt files, use a tool like `odt2txt` or `unoconv` to convert them to text en masse. both are available in the repos.
+
 ### for best results

-feed `mkv-this` large-ish amounts of well punctuated text. it works best if you bulk replace/remove as much mess as possible (code, HTML tags, links, metadata, stars, bullets, lines, tables, etc.), unless you want mashed versions of those things in your output. (no need to clean up URLs though.)
+feed `mkv-this` large-ish amounts of well punctuated text. it works best if you bulk replace/remove as much mess as possible (code, timestamps, HTML, links, metadata, stars, bullets, lines, tables, etc.), unless you want mashed versions of those things in your output. (no need to clean up the webpages you input though!)

-you’ll probably want to edit or select things from the output. it doesn't rly output print-ready boilerplate bosh, although many bots are happily publishing its output directly. you might find that it prompts you to edit it like a bot.
+you’ll probably want to edit or select things from the output. it doesn't rly output print-ready boilerplate bosh, although many bots are happily publishing its output directly.

 for a few further tips, see https://github.com/jsvine/markovify#basic-usage.

@ -82,5 +84,7 @@ i know nothing about macs so if you ask me for help i'll just send you random co

 ### todo

-* option to also append input model to a saved JSON file. (i.e. `text_model.to_json()`, `markovify.Text.from_json()`)
+* hook it up to a web-scraper.
+* hook it up to pdfs.
+* option to also append input model to a saved JSON file. (i.e. `text_model.to_json()`, `markovify.Text.from_json()`). that way you could build up a bank over time.
 * learn how to programme.
--- a/mkv_this/functions.py
+++ b/mkv_this/functions.py
@ -0,0 +1,144 @@
+import os
+import re
+import requests
+import markovify
+import sys
+import argparse
+import html2text
+
+fnf = ': error: file not found. please provide a path to a really-existing file!'
+
+
+def URL(insert):
+    """ fetch a url """
+    try:
+        req = requests.get(insert)
+        req.raise_for_status()
+    except Exception as exc:
+        print(f': There was a problem: {exc}.\n: Please enter a valid URL')
+        sys.exit()
+    else:
+        print(': fetched URL.')
+        return req.text
+
+
+def convert_html(html):
+    """ convert a fetched page to text """
+    h2t = html2text.HTML2Text()
+    h2t.ignore_links = True
+    h2t.images_to_alt = True
+    h2t.ignore_emphasis = True
+    h2t.ignore_tables = True
+    h2t.unicode_snob = True
+    h2t.decode_errors = 'ignore'
+    h2t.escape_all = False # remove all noise if needed
+    print(': URL converted to text')
+    s = h2t.handle(html)
+    s = re.sub('[#*]', '', s) # remove hashes and stars from the 'markdown'
+    return s
+
+
+def read(infile):
+    """ read your (local) file for the markov model """
+    try:
+        with open(infile, encoding="utf-8") as f:
+            return f.read()
+    except UnicodeDecodeError:
+        with open(infile, encoding="latin-1") as f:
+            return f.read()
+    except FileNotFoundError:
+        print(fnf)
+        sys.exit()
+
+
+def mkbtext(texttype, args_ss, args_wf):
+    """ build a markov model """
+    return markovify.Text(texttype, state_size=args_ss,
+                          well_formed=args_wf)
+
+
+def mkbnewline(texttype, args_ss, args_wf):
+    """ build a markov model, newline """
+    return markovify.NewlineText(texttype, state_size=args_ss,
+                                 well_formed=args_wf)
+
+
+def writeshortsentence(tmodel, args_sen, args_out, args_over, args_len):
+    """ actually make the damn litter-atchya """
+    for i in range(args_sen):
+        output = open(args_out, 'a')  # append
+        output.write(str(tmodel.make_short_sentence(
+            tries=2000, max_overlap_ratio=args_over,
+            max_chars=args_len)) + '\n\n')
+    output.write(str('*\n\n'))
+    output.close()
+
+
+def writesentence(tmodel, args_sen, args_out, args_over, args_len):
+    """ actually make the damn litter-atchya, and short """
+    for i in range(args_sen):
+        output = open(args_out, 'a')  # append
+        output.write(str(tmodel.make_sentence(
+            tries=2000, max_overlap_ratio=args_over,
+            max_chars=args_len)) + '\n\n')
+    output.write(str('*\n\n'))
+    output.close()
+
+
+### functions for mkv_this_scr.py
+
+def get_urls(st_url):
+    """ fetch a bunch of article URLs from The Guardian world news page for a given date. Format: 'https://theguardian.com/cat/YEAR/mth/xx' """
+    try:
+        req = requests.get(st_url)
+        req.raise_for_status()
+    except Exception as exc:
+        print(f': There was a problem: {exc}.\n: Please enter a valid URL')
+        sys.exit()
+    else:
+        print(': fetched initial URL.')
+        soup = bs4.BeautifulSoup(req.text, "lxml")
+        art_elem = soup.select('div[class="fc-item__header"] a[data-link-name="article"]') # pull the element containing article links.
+        urls = []
+        for i in range(len(art_elem)):
+            urls = urls + [art_elem[i].attrs['href']]
+        print(': fetched list of URLs')
+        return urls # returns a LIST
+        
+
+def scr_URLs(urls): # input a LIST
+    """ actually fetch all the URLs obtained by get_urls """
+    try:
+        content = []
+        for i in range(len(urls)):
+            req = requests.get(urls[i])
+            req.raise_for_status()
+            content = content + [req.text] # SUPER slow.
+            print(': fetched page ' + urls[i])
+    except Exception as exc:
+        print(f': There was a problem: {exc}.\n: There was trouble in your list of URLs')
+        sys.exit()
+    else:
+        print(': fetched all pages.')
+        return content
+
+    
+def scr_convert_html(content): # takes a LIST of html pages
+    """ convert all pages obtained by scr_URLs """
+    h2t = html2text.HTML2Text()
+    h2t.ignore_links = True
+    h2t.images_to_alt = True
+    h2t.ignore_emphasis = True
+    h2t.ignore_tables = True
+    h2t.unicode_snob = True
+    h2t.decode_errors = 'ignore'
+    h2t.escape_all = False # remove all noise if needed
+    s = []
+    for i in range(len(content)):
+        s = s + [h2t.handle(content[i])] # convert
+    t = []
+    for i in range(len(s)):
+        t = t + [re.sub('[#*]', '', s[i])] # remove hash/star from the 'markdown'
+    u = ' '.join(t) # convert list to string
+    print(': Pages converted to text')
+    return u
--- a/mkv_this/mkv_this.py
+++ b/mkv_this/mkv_this.py
@ -1,8 +1,9 @@
 #! /usr/bin/env python3

 """
-    mkv-this: input text, output markovified text.
-    Copyright (C) 2020 mousebot@riseup.net.
+    mkv-this: input text and/or url, output markovified text.
+
+    Copyright (C) 2020 martianhiatus@riseup.net.

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@ -17,23 +18,20 @@
    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
 """
-"""
-    a (very basic) script to markovify local and/or remote text files and
-    output a user-specified number of sentences to a local text file.
-    see --help for other options.
-"""

-import os
+
 import re
 import requests
 import markovify
+import html2text
+import os
 import sys
 import argparse
-import html2text
+from functions import URL, convert_html, read, mkbtext, mkbnewline, writesentence, writeshortsentence

 # argparse
 def parse_the_args():
-    parser = argparse.ArgumentParser(prog="mkv-this", description="markovify one or two local or remote text files and output the results to a local text file.",
+    parser = argparse.ArgumentParser(prog="mkv-this", description="markovify local text files or URLs and output the results to a local text file.",
                                     epilog="may you find many prophetic énoncés in your virtual bird guts! Here, this is not at all the becomings that are connected... so if you want to edit it like a bot yourself, it is trivial.")

    # positional args:
@ -45,7 +43,7 @@ def parse_the_args():
    # optional args:
    parser.add_argument('-s', '--state-size', help="the number of preceeding words used to calculate the probability of the next word. defaults to 2, 1 makes it more random, 3 less so. > 4 will likely have little effect.", type=int, default=2)
    parser.add_argument(
-        '-n', '--sentences', help="the number of 'sentences' to output. defaults to 5.", type=int, default=5)
+        '-n', '--sentences', help="the number of 'sentences' to output. defaults to 5. NB: if your text has no initial caps, a 'sentence' will be a paragraph.", type=int, default=5)
    parser.add_argument(
        '-l', '--length', help="set maximum number of characters per sentence.", type=int)
    parser.add_argument(
@ -67,79 +65,9 @@ def parse_the_args():

    return parser.parse_args()

-# fetch/read/build/write fns:

-
-def URL(insert):
-    try:
-        req = requests.get(insert)
-        req.raise_for_status()
-    except Exception as exc:
-        print(f': There was a problem: {exc}.\n: Please enter a valid URL')
-        sys.exit()
-    else:
-        print(': fetched URL.')
-        return req.text
-
-
-def convert_html(html):
-    h2t = html2text.HTML2Text()
-    h2t.ignore_links = True
-    h2t.images_to_alt = True
-    h2t.ignore_emphasis = True
-    h2t.ignore_tables = True
-    h2t.unicode_snob = True
-    h2t.decode_errors = 'ignore'
-    h2t.escape_all = False # remove all noise if needed
-    print(': URL converted to text')
-    s = h2t.handle(html)
-    s = re.sub('[#*]', '', s) # remove hashes and stars from the 'markdown'
-    return s
-
-
-def read(infile):
-    try:
-        with open(infile, encoding="utf-8") as f:
-            return f.read()
-    except UnicodeDecodeError:
-        with open(infile, encoding="latin-1") as f:
-            return f.read()
-    except FileNotFoundError:
-        print(fnf)
-        sys.exit()
-
-
-def mkbtext(texttype):
-    return markovify.Text(texttype, state_size=args.state_size,
-                          well_formed=args.well_formed)
-
-
-def mkbnewline(texttype):
-    return markovify.NewlineText(texttype, state_size=args.state_size,
-                                 well_formed=args.well_formed)
-
-
-def writesentence(tmodel):
-    for i in range(args.sentences):
-        output = open(args.outfile, 'a')  # append
-        # short:
-        if args.length:
-            output.write(str(tmodel.make_short_sentence(
-                tries=2000, max_overlap_ratio=args.overlap,
-                max_chars=args.length)) + '\n\n')
-        # normal:
-        else:
-            output.write(str(tmodel.make_sentence(
-                tries=2000, max_overlap_ratio=args.overlap,
-                max_chars=args.length)) + '\n\n')
-    output.write(str('*\n\n'))
-    output.close()
-
-
-# make args + fnf avail to all:
+# make args avail:
 args = parse_the_args()
-fnf = ': error: file not found. please provide a path to a really-existing \
-    file!'


 def main():
@ -147,7 +75,6 @@ def main():
    if args.combine or args.combine_URL:
        if args.combine:
            # get raw text as a string for both:
-            #            try:
            # infile is URL:
            if args.URL:
                html = URL(args.infile)
@ -160,7 +87,6 @@ def main():

        # if -C, combine it w infile/URL:
        elif args.combine_URL:
-            #            try:
            # infile is URL:
            if args.URL:
                html = URL(args.infile)
@ -175,17 +101,21 @@ def main():
        # build the models + a combined model:
        # with --newline:
        if args.newline:
-            text_model = mkbnewline(text)
-            ctext_model = mkbnewline(ctext)
+            text_model = mkbnewline(text, args.state_size, args.well_formed)
+            ctext_model = mkbnewline(ctext, args.state_size, args.well_formed)
        # no --newline:
        else:
-            text_model = mkbtext(text)
-            ctext_model = mkbtext(ctext)
+            text_model = mkbtext(text, args.state_size, args.well_formed)
+            ctext_model = mkbtext(ctext, args.state_size, args.well_formed)

        combo_model = markovify.combine(
            [text_model, ctext_model], [1, args.weight])

-        writesentence(combo_model)
+        # write it combo!
+        if args.length:
+            writeshortsentence(combo_model, args.sentences, args.outfile, args.overlap, args.length)
+        else:
+            writesentence(combo_model, args.sentences, args.outfile, args.overlap, args.length)

    # if no -c/-C, do normal:
    else:
@ -201,14 +131,18 @@ def main():
        # Build the model:
        # if --newline:
        if args.newline:
-            text_model = mkbnewline(text)
+            text_model = mkbnewline(text, args.state_size, args.well_formed)
        # no --newline:
        else:
-            text_model = mkbtext(text)
+            text_model = mkbtext(text, args.state_size, args.well_formed)

-        writesentence(text_model)
+        # write it!
+        if args.length:
+            writeshortsentence(text_model, args.sentences, args.outfile, args.overlap, args.length)
+        else:
+            writesentence(text_model, args.sentences, args.outfile, args.overlap, args.length)

-    print('\n: The options you used are as follows:\n')
+    print('\n:                :\n')
    for key, value in vars(args).items():
        print(': ' + key.ljust(15, ' ') + ':  ' + str(value).ljust(10))
    if os.path.isfile(args.outfile):
@ -220,6 +154,5 @@ def main():
    sys.exit()


-# for testing:
 if __name__ == '__main__':
    main()
--- a/mkv_this/mkv_this_dir.py
+++ b/mkv_this/mkv_this_dir.py
@ -1,7 +1,8 @@
 #! /usr/bin/env python3
 """
-    mkv-this-dir: input a directory, output markovified text based on all its text files.
-    Copyright (C) 2020 mousebot@riseup.net.
+    mkv-this-dir: input a directory (+ optional url), output markovified text based on all its text files.
+
+    Copyright (C) 2020 martianhiatus@riseup.net.

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@ -17,17 +18,15 @@
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
 """

-"""
-a (very basic) script to collect all text files in a directory, markovify them and output a user-specified number of sentences to a text file.
-"""

 import os
 import re
+import requests
 import markovify
 import sys
 import argparse
 import html2text
-import requests
+from functions import URL, convert_html, read, mkbtext, mkbnewline, writesentence, writeshortsentence

 # argparse
 def parse_the_args():
@ -40,7 +39,7 @@ def parse_the_args():

    # optional args:
    parser.add_argument('-s', '--state-size', help="the number of preceeding words the probability of the next word depends on. defaults to 2, 1 makes it more random, 3 less so.", type=int, default=2)
-    parser.add_argument('-n', '--sentences', help="the number of 'sentences' to output. defaults to 5.", type=int, default=5)
+    parser.add_argument('-n', '--sentences', help="the number of 'sentences' to output. defaults to 5. NB: if your text has no initial caps, a 'sentence' will be a paragraph.", type=int, default=5)
    parser.add_argument('-l', '--length', help="set maximum number of characters per sentence.", type=int)
    parser.add_argument('-o', '--overlap', help="the amount of overlap allowed between original text and the output, expressed as a radio between 0 and 1. lower values make it more random. defaults to 0.5", type=float, default=0.5)
    parser.add_argument('-C', '--combine-URL', help="provide a URL to be combined with the input dir")
@ -54,75 +53,11 @@ def parse_the_args():

    return parser.parse_args()

-# fetch/read/build/write fns:

-
-def URL(insert):
-    try:
-        req = requests.get(insert)
-        req.raise_for_status()
-    except Exception as exc:
-        print(f': There was a problem: {exc}.\n: Please enter a valid URL')
-        sys.exit()
-    else:
-        print(': fetched URL.')
-        return req.text
-
-
-def convert_html(html):
-    h2t = html2text.HTML2Text()
-    h2t.ignore_links = True
-    h2t.images_to_alt = True
-    h2t.ignore_emphasis = True
-    h2t.ignore_tables = True
-    h2t.unicode_snob = True
-    h2t.decode_errors = 'ignore'
-    h2t.escape_all = False # remove all noise if needed
-    print(': URL converted to text')
-    s = h2t.handle(html)
-    s = re.sub('[#*]', '', s) # remove hashes and stars from the 'markdown'
-    return s
-
-
-def read(infile):
-    try:
-        with open(infile, encoding="utf-8") as f:
-            return f.read()
-    except UnicodeDecodeError:
-        with open(infile, encoding="latin-1") as f:
-            return f.read()
-    except FileNotFoundError:
-        print(fnf)
-        sys.exit()
-
-def mkbtext(texttype):
-    return markovify.Text(texttype, state_size=args.state_size,
-                          well_formed=args.well_formed)
-
-def mkbnewline(texttype):
-    return markovify.NewlineText(texttype, state_size=args.state_size,
-                          well_formed=args.well_formed)
-
-def writesentence(tmodel):
-    for i in range(args.sentences):
-        output = open(args.outfile, 'a')  # append
-        # short:
-        if args.length:
-            output.write(str(tmodel.make_short_sentence(
-                tries=2000, max_overlap_ratio=args.overlap,
-                max_chars=args.length)) + '\n\n')
-        # normal:
-        else:
-            output.write(str(tmodel.make_sentence(
-                tries=2000, max_overlap_ratio=args.overlap,
-                max_chars=args.length)) + '\n\n\n')
-    output.write(str('*\n\n'))
-    output.close()
-
-    
-# make args avail to all:
+# make args avail:
 args = parse_the_args()

+
 def main():
    #create a list of files to concatenate:
    matches = []
@ -160,33 +95,42 @@ def main():
        # Build combo model:
        # if --newline:
        if args.newline:
-            text_model = mkbnewline(text)
-            ctext_model = mkbnewline(ctext)
+            text_model = mkbnewline(text, args.state_size, args.well_formed)
+            ctext_model = mkbnewline(ctext, args.state_size, args.well_formed)
        # no --newline:
        else:
-            text_model = mkbtext(text)
-            ctext_model = mkbtext(ctext)
+            text_model = mkbtext(text, args.state_size, args.well_formed)
+            ctext_model = mkbtext(ctext, args.state_size, args.well_formed)

        combo_model = markovify.combine(
            [text_model, ctext_model], [1, args.weight])

-        writesentence(combo_model)
-        
-    # no combining:
+        # write it combo!
+        if args.length:
+            writeshortsentence(combo_model, args.sentences, args.outfile, args.overlap, args.length)
+        else:
+            writesentence(combo_model, args.sentences, args.outfile, args.overlap, args.length)
+
+        # no combining:
    else:
        # Build model:
        # if --newline:
        if args.newline:
-            text_model = mkbnewline(text)
+            text_model = mkbnewline(text, args.state_size, args.well_formed)
        # no --newline:
        else:
-            text_model = mkbtext(text)
+            text_model = mkbtext(text, args.state_size, args.well_formed)

-        writesentence(text_model)
+        # write it!
+        if args.length:
+            writeshortsentence(text_model, args.sentences, args.outfile, args.overlap, args.length)
+        else:
+            writesentence(text_model, args.sentences, args.outfile, args.overlap, args.length)
        
    os.unlink(batchfile)

-    print('\n: The options you used are as follows:\n')
+#    print('\n: The options you used are as follows:\n')
+    print('\n:                :\n')
    for key, value in vars(args).items():
        print(': ' + key.ljust(15, ' ') + ':  ' + str(value).ljust(10))
    if os.path.isfile(args.outfile):
@ -197,6 +141,5 @@ def main():

    sys.exit()
    
-# for testing:
 if __name__ == '__main__':
    main()
--- a/setup.py
+++ b/setup.py
@ -7,7 +7,7 @@ with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
    long_description = f.read()
    
 setup(name='mkv-this',
-      version='0.1.35',
+      version='0.1.40',
      description='cli wrapper for markovify: take a text file or URL, markovify, save the results.',
      long_description=long_description,
      long_description_content_type='text/markdown',