separate functions file, split writesentence in two, add multi page fetching fns also.

This commit is contained in:
mousebot 2020-04-25 19:56:24 -03:00
parent d06f84a5f1
commit 7063c8339b
5 changed files with 213 additions and 189 deletions

View File

@ -1,8 +1,8 @@
## disclaimer
`mkv-this` simply makes some of the features of the excellent [markovify](https://github.com/jsvine/markovify) module available as a command line tool. i started on it because i wanted to to process my own offline files. i published it to share with friends. i'm a totally novice coder. so you are a programmer and felt like picking it up and improving on it, then by all means!
`mkv-this` makes some of the features of the excellent [markovify](https://github.com/jsvine/markovify) module available as a command line tool. i started on it because i wanted to process my own offline files the same way [fedibooks](https://fedibooks.com) can process mastodon toots. then i published it to share with friends. i'm a totally novice coder, so you are a programmer and felt like picking it up and improving on it, then by all means!
the rest of these notes are for laypersons rather than programmers.
the rest of these notes are for end users rather than programmers.
## mkv-this
@ -10,7 +10,7 @@ the rest of these notes are for laypersons rather than programmers.
a second command, `mkv-this-dir` (see below) allows you to input a directory and it will read all text files within it as the input.
both commands allow you to add a second file or URL to the input, so you can combine your secret diary with some vile shit from the web.
both commands allow you to add a second file or URL to the input, so you can combine your secret diary with some vile shit from the webz.
### installing
@ -34,11 +34,11 @@ if you get sth like `ModuleNotFound error: No module named '$modulename'`, just
the script implements a number of the basic `markovify` options, so you can specify:
* how many sentences to output (default = 5).
* how many sentences to output (default = 5; NB: if your input does not use initial capitals, each 'sentence' will be more like a paragraph).
* the state size, i.e. the number of preceding words to be used in calculating the choice of the next word (default = 2).
* a maximum sentence length, in characters.
* the amount of (verbatim) overlap allowed between input and output.
* if your text's sentences end with newlines rather than full-stops.
* if your text's sentences end with newlines rather than full-stops. handy for inputting poetry.
* an additional file or URL to use for text input. you can add only one. if you want to feed a stack of files into your bank, use `mkv-this-dir` (see below).
* the relative weight to give to the second file if it is used.
@ -48,7 +48,7 @@ run `mkv-this -h` to see how to use these options.
if you want to input a stack of files, use `mkv-this-dir` instead. specify a directory and all text files in it will be used as input.
as with `mkv-this` you can also combine it with a URL.
as with `mkv-this` you can also combine your directory with a URL.
if for some reason you want to get a similar funtionality with `mkv-this`, you can easily concatenate the files yourself from the command line, then process the resulting file:
@ -62,11 +62,13 @@ if for some reason you want to get a similar funtionality with `mkv-this`, you c
you need to input plain text files. currently accepted file extensions are `.txt`, `.org` and `.md`. it is trivial to add others, so if you want one included just ask.
if you don't have text files, but odt files, use a tool like `odt2txt` or `unoconv` to convert them to text en masse. both are available in the repos.
### for best results
feed `mkv-this` large-ish amounts of well punctuated text. it works best if you bulk replace/remove as much mess as possible (code, HTML tags, links, metadata, stars, bullets, lines, tables, etc.), unless you want mashed versions of those things in your output. (no need to clean up URLs though.)
feed `mkv-this` large-ish amounts of well punctuated text. it works best if you bulk replace/remove as much mess as possible (code, timestamps, HTML, links, metadata, stars, bullets, lines, tables, etc.), unless you want mashed versions of those things in your output. (no need to clean up the webpages you input though!)
youll probably want to edit or select things from the output. it doesn't rly output print-ready boilerplate bosh, although many bots are happily publishing its output directly. you might find that it prompts you to edit it like a bot.
youll probably want to edit or select things from the output. it doesn't rly output print-ready boilerplate bosh, although many bots are happily publishing its output directly.
for a few further tips, see https://github.com/jsvine/markovify#basic-usage.
@ -82,5 +84,7 @@ i know nothing about macs so if you ask me for help i'll just send you random co
### todo
* option to also append input model to a saved JSON file. (i.e. `text_model.to_json()`, `markovify.Text.from_json()`)
* hook it up to a web-scraper.
* hook it up to pdfs.
* option to also append input model to a saved JSON file. (i.e. `text_model.to_json()`, `markovify.Text.from_json()`). that way you could build up a bank over time.
* learn how to programme.

144
mkv_this/functions.py Normal file
View File

@ -0,0 +1,144 @@
import os
import re
import requests
import markovify
import sys
import argparse
import html2text
fnf = ': error: file not found. please provide a path to a really-existing file!'
def URL(insert):
""" fetch a url """
try:
req = requests.get(insert)
req.raise_for_status()
except Exception as exc:
print(f': There was a problem: {exc}.\n: Please enter a valid URL')
sys.exit()
else:
print(': fetched URL.')
return req.text
def convert_html(html):
""" convert a fetched page to text """
h2t = html2text.HTML2Text()
h2t.ignore_links = True
h2t.images_to_alt = True
h2t.ignore_emphasis = True
h2t.ignore_tables = True
h2t.unicode_snob = True
h2t.decode_errors = 'ignore'
h2t.escape_all = False # remove all noise if needed
print(': URL converted to text')
s = h2t.handle(html)
s = re.sub('[#*]', '', s) # remove hashes and stars from the 'markdown'
return s
def read(infile):
""" read your (local) file for the markov model """
try:
with open(infile, encoding="utf-8") as f:
return f.read()
except UnicodeDecodeError:
with open(infile, encoding="latin-1") as f:
return f.read()
except FileNotFoundError:
print(fnf)
sys.exit()
def mkbtext(texttype, args_ss, args_wf):
""" build a markov model """
return markovify.Text(texttype, state_size=args_ss,
well_formed=args_wf)
def mkbnewline(texttype, args_ss, args_wf):
""" build a markov model, newline """
return markovify.NewlineText(texttype, state_size=args_ss,
well_formed=args_wf)
def writeshortsentence(tmodel, args_sen, args_out, args_over, args_len):
""" actually make the damn litter-atchya """
for i in range(args_sen):
output = open(args_out, 'a') # append
output.write(str(tmodel.make_short_sentence(
tries=2000, max_overlap_ratio=args_over,
max_chars=args_len)) + '\n\n')
output.write(str('*\n\n'))
output.close()
def writesentence(tmodel, args_sen, args_out, args_over, args_len):
""" actually make the damn litter-atchya, and short """
for i in range(args_sen):
output = open(args_out, 'a') # append
output.write(str(tmodel.make_sentence(
tries=2000, max_overlap_ratio=args_over,
max_chars=args_len)) + '\n\n')
output.write(str('*\n\n'))
output.close()
### functions for mkv_this_scr.py
def get_urls(st_url):
""" fetch a bunch of article URLs from The Guardian world news page for a given date. Format: 'https://theguardian.com/cat/YEAR/mth/xx' """
try:
req = requests.get(st_url)
req.raise_for_status()
except Exception as exc:
print(f': There was a problem: {exc}.\n: Please enter a valid URL')
sys.exit()
else:
print(': fetched initial URL.')
soup = bs4.BeautifulSoup(req.text, "lxml")
art_elem = soup.select('div[class="fc-item__header"] a[data-link-name="article"]') # pull the element containing article links.
urls = []
for i in range(len(art_elem)):
urls = urls + [art_elem[i].attrs['href']]
print(': fetched list of URLs')
return urls # returns a LIST
def scr_URLs(urls): # input a LIST
""" actually fetch all the URLs obtained by get_urls """
try:
content = []
for i in range(len(urls)):
req = requests.get(urls[i])
req.raise_for_status()
content = content + [req.text] # SUPER slow.
print(': fetched page ' + urls[i])
except Exception as exc:
print(f': There was a problem: {exc}.\n: There was trouble in your list of URLs')
sys.exit()
else:
print(': fetched all pages.')
return content
def scr_convert_html(content): # takes a LIST of html pages
""" convert all pages obtained by scr_URLs """
h2t = html2text.HTML2Text()
h2t.ignore_links = True
h2t.images_to_alt = True
h2t.ignore_emphasis = True
h2t.ignore_tables = True
h2t.unicode_snob = True
h2t.decode_errors = 'ignore'
h2t.escape_all = False # remove all noise if needed
s = []
for i in range(len(content)):
s = s + [h2t.handle(content[i])] # convert
t = []
for i in range(len(s)):
t = t + [re.sub('[#*]', '', s[i])] # remove hash/star from the 'markdown'
u = ' '.join(t) # convert list to string
print(': Pages converted to text')
return u

View File

@ -1,8 +1,9 @@
#! /usr/bin/env python3
"""
mkv-this: input text, output markovified text.
Copyright (C) 2020 mousebot@riseup.net.
mkv-this: input text and/or url, output markovified text.
Copyright (C) 2020 martianhiatus@riseup.net.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@ -17,23 +18,20 @@
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
"""
"""
a (very basic) script to markovify local and/or remote text files and
output a user-specified number of sentences to a local text file.
see --help for other options.
"""
import os
import re
import requests
import markovify
import html2text
import os
import sys
import argparse
import html2text
from functions import URL, convert_html, read, mkbtext, mkbnewline, writesentence, writeshortsentence
# argparse
def parse_the_args():
parser = argparse.ArgumentParser(prog="mkv-this", description="markovify one or two local or remote text files and output the results to a local text file.",
parser = argparse.ArgumentParser(prog="mkv-this", description="markovify local text files or URLs and output the results to a local text file.",
epilog="may you find many prophetic énoncés in your virtual bird guts! Here, this is not at all the becomings that are connected... so if you want to edit it like a bot yourself, it is trivial.")
# positional args:
@ -45,7 +43,7 @@ def parse_the_args():
# optional args:
parser.add_argument('-s', '--state-size', help="the number of preceeding words used to calculate the probability of the next word. defaults to 2, 1 makes it more random, 3 less so. > 4 will likely have little effect.", type=int, default=2)
parser.add_argument(
'-n', '--sentences', help="the number of 'sentences' to output. defaults to 5.", type=int, default=5)
'-n', '--sentences', help="the number of 'sentences' to output. defaults to 5. NB: if your text has no initial caps, a 'sentence' will be a paragraph.", type=int, default=5)
parser.add_argument(
'-l', '--length', help="set maximum number of characters per sentence.", type=int)
parser.add_argument(
@ -67,79 +65,9 @@ def parse_the_args():
return parser.parse_args()
# fetch/read/build/write fns:
def URL(insert):
try:
req = requests.get(insert)
req.raise_for_status()
except Exception as exc:
print(f': There was a problem: {exc}.\n: Please enter a valid URL')
sys.exit()
else:
print(': fetched URL.')
return req.text
def convert_html(html):
h2t = html2text.HTML2Text()
h2t.ignore_links = True
h2t.images_to_alt = True
h2t.ignore_emphasis = True
h2t.ignore_tables = True
h2t.unicode_snob = True
h2t.decode_errors = 'ignore'
h2t.escape_all = False # remove all noise if needed
print(': URL converted to text')
s = h2t.handle(html)
s = re.sub('[#*]', '', s) # remove hashes and stars from the 'markdown'
return s
def read(infile):
try:
with open(infile, encoding="utf-8") as f:
return f.read()
except UnicodeDecodeError:
with open(infile, encoding="latin-1") as f:
return f.read()
except FileNotFoundError:
print(fnf)
sys.exit()
def mkbtext(texttype):
return markovify.Text(texttype, state_size=args.state_size,
well_formed=args.well_formed)
def mkbnewline(texttype):
return markovify.NewlineText(texttype, state_size=args.state_size,
well_formed=args.well_formed)
def writesentence(tmodel):
for i in range(args.sentences):
output = open(args.outfile, 'a') # append
# short:
if args.length:
output.write(str(tmodel.make_short_sentence(
tries=2000, max_overlap_ratio=args.overlap,
max_chars=args.length)) + '\n\n')
# normal:
else:
output.write(str(tmodel.make_sentence(
tries=2000, max_overlap_ratio=args.overlap,
max_chars=args.length)) + '\n\n')
output.write(str('*\n\n'))
output.close()
# make args + fnf avail to all:
# make args avail:
args = parse_the_args()
fnf = ': error: file not found. please provide a path to a really-existing \
file!'
def main():
@ -147,7 +75,6 @@ def main():
if args.combine or args.combine_URL:
if args.combine:
# get raw text as a string for both:
# try:
# infile is URL:
if args.URL:
html = URL(args.infile)
@ -160,7 +87,6 @@ def main():
# if -C, combine it w infile/URL:
elif args.combine_URL:
# try:
# infile is URL:
if args.URL:
html = URL(args.infile)
@ -175,17 +101,21 @@ def main():
# build the models + a combined model:
# with --newline:
if args.newline:
text_model = mkbnewline(text)
ctext_model = mkbnewline(ctext)
text_model = mkbnewline(text, args.state_size, args.well_formed)
ctext_model = mkbnewline(ctext, args.state_size, args.well_formed)
# no --newline:
else:
text_model = mkbtext(text)
ctext_model = mkbtext(ctext)
text_model = mkbtext(text, args.state_size, args.well_formed)
ctext_model = mkbtext(ctext, args.state_size, args.well_formed)
combo_model = markovify.combine(
[text_model, ctext_model], [1, args.weight])
writesentence(combo_model)
# write it combo!
if args.length:
writeshortsentence(combo_model, args.sentences, args.outfile, args.overlap, args.length)
else:
writesentence(combo_model, args.sentences, args.outfile, args.overlap, args.length)
# if no -c/-C, do normal:
else:
@ -201,14 +131,18 @@ def main():
# Build the model:
# if --newline:
if args.newline:
text_model = mkbnewline(text)
text_model = mkbnewline(text, args.state_size, args.well_formed)
# no --newline:
else:
text_model = mkbtext(text)
text_model = mkbtext(text, args.state_size, args.well_formed)
writesentence(text_model)
# write it!
if args.length:
writeshortsentence(text_model, args.sentences, args.outfile, args.overlap, args.length)
else:
writesentence(text_model, args.sentences, args.outfile, args.overlap, args.length)
print('\n: The options you used are as follows:\n')
print('\n: :\n')
for key, value in vars(args).items():
print(': ' + key.ljust(15, ' ') + ': ' + str(value).ljust(10))
if os.path.isfile(args.outfile):
@ -220,6 +154,5 @@ def main():
sys.exit()
# for testing:
if __name__ == '__main__':
main()

View File

@ -1,7 +1,8 @@
#! /usr/bin/env python3
"""
mkv-this-dir: input a directory, output markovified text based on all its text files.
Copyright (C) 2020 mousebot@riseup.net.
mkv-this-dir: input a directory (+ optional url), output markovified text based on all its text files.
Copyright (C) 2020 martianhiatus@riseup.net.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@ -17,17 +18,15 @@
along with this program. If not, see <https://www.gnu.org/licenses/>.
"""
"""
a (very basic) script to collect all text files in a directory, markovify them and output a user-specified number of sentences to a text file.
"""
import os
import re
import requests
import markovify
import sys
import argparse
import html2text
import requests
from functions import URL, convert_html, read, mkbtext, mkbnewline, writesentence, writeshortsentence
# argparse
def parse_the_args():
@ -40,7 +39,7 @@ def parse_the_args():
# optional args:
parser.add_argument('-s', '--state-size', help="the number of preceeding words the probability of the next word depends on. defaults to 2, 1 makes it more random, 3 less so.", type=int, default=2)
parser.add_argument('-n', '--sentences', help="the number of 'sentences' to output. defaults to 5.", type=int, default=5)
parser.add_argument('-n', '--sentences', help="the number of 'sentences' to output. defaults to 5. NB: if your text has no initial caps, a 'sentence' will be a paragraph.", type=int, default=5)
parser.add_argument('-l', '--length', help="set maximum number of characters per sentence.", type=int)
parser.add_argument('-o', '--overlap', help="the amount of overlap allowed between original text and the output, expressed as a radio between 0 and 1. lower values make it more random. defaults to 0.5", type=float, default=0.5)
parser.add_argument('-C', '--combine-URL', help="provide a URL to be combined with the input dir")
@ -54,75 +53,11 @@ def parse_the_args():
return parser.parse_args()
# fetch/read/build/write fns:
def URL(insert):
try:
req = requests.get(insert)
req.raise_for_status()
except Exception as exc:
print(f': There was a problem: {exc}.\n: Please enter a valid URL')
sys.exit()
else:
print(': fetched URL.')
return req.text
def convert_html(html):
h2t = html2text.HTML2Text()
h2t.ignore_links = True
h2t.images_to_alt = True
h2t.ignore_emphasis = True
h2t.ignore_tables = True
h2t.unicode_snob = True
h2t.decode_errors = 'ignore'
h2t.escape_all = False # remove all noise if needed
print(': URL converted to text')
s = h2t.handle(html)
s = re.sub('[#*]', '', s) # remove hashes and stars from the 'markdown'
return s
def read(infile):
try:
with open(infile, encoding="utf-8") as f:
return f.read()
except UnicodeDecodeError:
with open(infile, encoding="latin-1") as f:
return f.read()
except FileNotFoundError:
print(fnf)
sys.exit()
def mkbtext(texttype):
return markovify.Text(texttype, state_size=args.state_size,
well_formed=args.well_formed)
def mkbnewline(texttype):
return markovify.NewlineText(texttype, state_size=args.state_size,
well_formed=args.well_formed)
def writesentence(tmodel):
for i in range(args.sentences):
output = open(args.outfile, 'a') # append
# short:
if args.length:
output.write(str(tmodel.make_short_sentence(
tries=2000, max_overlap_ratio=args.overlap,
max_chars=args.length)) + '\n\n')
# normal:
else:
output.write(str(tmodel.make_sentence(
tries=2000, max_overlap_ratio=args.overlap,
max_chars=args.length)) + '\n\n\n')
output.write(str('*\n\n'))
output.close()
# make args avail to all:
# make args avail:
args = parse_the_args()
def main():
#create a list of files to concatenate:
matches = []
@ -160,33 +95,42 @@ def main():
# Build combo model:
# if --newline:
if args.newline:
text_model = mkbnewline(text)
ctext_model = mkbnewline(ctext)
text_model = mkbnewline(text, args.state_size, args.well_formed)
ctext_model = mkbnewline(ctext, args.state_size, args.well_formed)
# no --newline:
else:
text_model = mkbtext(text)
ctext_model = mkbtext(ctext)
text_model = mkbtext(text, args.state_size, args.well_formed)
ctext_model = mkbtext(ctext, args.state_size, args.well_formed)
combo_model = markovify.combine(
[text_model, ctext_model], [1, args.weight])
writesentence(combo_model)
# no combining:
# write it combo!
if args.length:
writeshortsentence(combo_model, args.sentences, args.outfile, args.overlap, args.length)
else:
writesentence(combo_model, args.sentences, args.outfile, args.overlap, args.length)
# no combining:
else:
# Build model:
# if --newline:
if args.newline:
text_model = mkbnewline(text)
text_model = mkbnewline(text, args.state_size, args.well_formed)
# no --newline:
else:
text_model = mkbtext(text)
text_model = mkbtext(text, args.state_size, args.well_formed)
writesentence(text_model)
# write it!
if args.length:
writeshortsentence(text_model, args.sentences, args.outfile, args.overlap, args.length)
else:
writesentence(text_model, args.sentences, args.outfile, args.overlap, args.length)
os.unlink(batchfile)
print('\n: The options you used are as follows:\n')
# print('\n: The options you used are as follows:\n')
print('\n: :\n')
for key, value in vars(args).items():
print(': ' + key.ljust(15, ' ') + ': ' + str(value).ljust(10))
if os.path.isfile(args.outfile):
@ -197,6 +141,5 @@ def main():
sys.exit()
# for testing:
if __name__ == '__main__':
main()

View File

@ -7,7 +7,7 @@ with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
long_description = f.read()
setup(name='mkv-this',
version='0.1.35',
version='0.1.40',
description='cli wrapper for markovify: take a text file or URL, markovify, save the results.',
long_description=long_description,
long_description_content_type='text/markdown',