Tag Archives: Recursion

Essays regarding theory and practice of recursive algorithms.

Using a Briefcase™? Briefly, a Use Case™.

Briefcase: a special kind of folder in Microsoft Windows, which synchronizes its
 contents with another briefcase of the same name when detected. Used
 to keep volatile documents on floppy disks & USB flash drives without
 constantly copying and pasting the contents of the whole disk every
 time it moves from workstation to workstation. Like an rsync daemon.

Use case: in the studies of software design & architecture, a storyboard sketch,
 or supposition about what a user expects or how s/he'll behave. E.g..
 This would be the initial node of a flowchart, a branch in main(), GUI
 dialog panes, or some interaction of user with program.
 Additionally, analysis of hostile users and newbies ("misuse case").

I am still moribund ("deadlocked," or "sick to death") by a headache that has
become cerebral palsy. I have been unable to concentrate on my plans this year.

Speaking of contributions to science, you can find my (literally) auriferous
portfolio at the magnanimous MediaFire (they're not just for pirates!):

    (Download & read the CARGO-MANIFEST.TXT to ascertain the contents of the archives you seek.)
    WARNING: ADULTS ONLY. (Explicit sexual content.)
    Videlicet is still kind of broken. DiffWalk, too, may be faulty.

The hyperlink will lead you to a MediaFire directory. I have added new archives
(for bandwidth conservationists). The file CARGO-MANIFEST.TXT describes all the
contents: _download and read it first_ if you want to know what's in them there
archives, which total over one hundred Megabytes, &/or retrieve your preference.

What's new: kanamo & transl8 (in MLPTK), Mutate-o-Matic, Videlicet, & DiffWalk.
(I said MLPTK was officially dead, but will I let it rest? How about no...)
Archivists curating art galleries downloaded from social networks will love
Videlicet, which solves the vexing twin problems of automatic attribution and
re-configurable data mining. (For those pesky copy protection mechanisms.
Videlicet.py easily cuts through Web galleries and markup up to 1/4" thick.)

I even threw in the exprimental upnnas: yea, truly this is an epic day.
(^- That line alludes to one of the _Juicy Cerebellum_'s author's asides.)

The remainder of this briefing describes the salient points of a Python script I
wrote to automatically collate issues of my portfolio. Long story short: "diff."

Because the large size of the archives I upload has become problematic, I have
established a ramshackle mechanism to prepare smaller files for anyone concerned
about bandwidth conservation. (MediaFire reports only two Gigabytes since last
year, which is no big deal, but I certanly wasn't helping. Also I couldn't think
of much else to do.) In case you cared, the usual issues with bandwidth are
constriction & latency: to reuse Senator Ted Stevens' "tubes" metaphor, how wide
the tube is and how long it is, and either of these can alter an observer's
perception of the pressure of fluid forced through the pipe. "When the tubes get
full, things can't get through" -- like dead bodies, or the new episode of Veep.

Metaphorically one half of this mechanism is a portable diff utility: DiffWalk.
The other half is a shell script that identifies changes to the directory tree.
Neither is aught remarkable but why don't I talk your ear off about them anyway?

Diff is a program similar to cmp, used to compare two files and describe their
discrepancies. In common use on Unixlike systems, it is employed to create patch
files that require less time to transmit via point-to-point telecommunication
than would be needed to transmit the whole file whenever it changed. Because it
is so useful an algorithm, and because I've never seen one for Windows (except
in the Berkeley Utilities), I made (but didn't test) a portable one in Python.

DiffWalk is a walking collater that creates patches similar to diff's.
Although the two are not interoperable, they operate in the same manner:
by determination of where the files differ and description of the differences.
Therewith, a "new" file can be reconstructed from an "old" file plus a patch --
hypothetically, with according decrease of network bandwidth load.

Although the script is a few hundreds of lines long, the scanner (the part that
goes through the file looking for the interesting bits: such as, in this case,
the positions where the new file differs from the old) is one tenth that size.
As you've observed in my other software, I do without proper parsers & grammar.
This renders my work brief, vulgar, and full of bugs, but sometimes legible.

def diff (old_lines, new_lines): #fmt: old_offset old_lines new_lines\nlines\n
 patch_file = [ patch_copacetic_leadin ];
 scan_line = ""; # Compute MD5 checksums for both files...
 old_md5sum = hashlib.md5();
 for line in old_lines: old_md5sum.update(line);
 old_md5sum = old_md5sum.hexdigest();
 scan_line = "%s\t" % (old_md5sum);
 new_md5sum = hashlib.md5();
 for line in new_lines: new_md5sum.update(line);
 new_md5sum = new_md5sum.hexdigest();
 if new_md5sum == old_md5sum: return None; # same file? then no patch req'd.
 scan_line += "%s\n" % (new_md5sum);
 patch_file.append(scan_line); # Second line: old_md5 new_md5
 oi = 0; ol = len(old_lines); ni = 0; nl = len(new_lines);
 tally = 0; scan_line;
 unique_new_lines = set(new_lines) - set(old_lines);
 while ni < nl: # 2 phases: scan "same" lines, then diff lines
 oi = 0; tally = 0;
 while oi < ol and old_lines[oi] != new_lines[ni]: oi += 1;
 scan_line = "%d\t" % (oi); #Index in "old" file to cat some of its lines
 while oi < ol and ni < nl and old_lines[oi] == new_lines[ni]:
 tally += 1; ni += 1; oi += 1;
 scan_line += "%d\t" % (tally); # Number of lines to cat from "old" file
 tally = 0; next_ni = ni;
 while ni < nl and new_lines[next_ni] in unique_new_lines:
 tally += 1; next_ni += 1;
 scan_line += "%d\n" % (tally); # Number of lines to cat from "new" file
 patch_file.extend(new_lines[ni : next_ni]);
 ni = next_ni;
 # end while (scan the files, outputting the patch protocol format)
 return patch_file;
# end function diff: returns diff-style patches as writelines() compatible lists

Concise and transpicuous:
 1. Tally runs of lines that already existed in the old file. (Scan phase.)
 2. Tally runs of lines that do not exist in the old file. (Diff phase.)
 3. Print a patch format that permits ordered reconstitution of the lines.
 4. Repeat until the entire new file can be reconstructed from patch + old.

Here, Python's set()s abstract away a tedious series of repetitive scans.
Without set or a like data type, I'd have to either hash the "old" file's lines
myself (and waste time writing another binary tree) or loop through it all again
and again for each line of the new file. (That would be due to the fact that, if
lines had been moved about instead of simply moved apart by interjection, then a
lockstep scanner would mistakenly skip some and the patch file would be larger.)

There is no capacity to patch binary files, but DW still detects when they have
changed, and will write a copy into the patch directory. I assume that changes
to binary files are due to transcoding, and therefore the patch'd be just as big
-- some kinds of binary files, such as SQL databases, don't behave this way and
can be patched in the same manner as I patch text files, but I don't use them.
(If you extend the algorithm to databases or executables, don't forget to review
 the pertinent file formats and open the files in binary mode. :)

The rest of the script is a wrapper handling directory traversal and file I/O.

As `info diff` artfully states, "computer users often find occasion to ask how
2 files differ." The utility of a script like DiffWalk is therefore not limited
to patching, but compression protocol is its primary employment on my system. (I
still use `diff` for quotidian difference queries because DW isn't in my $PATH.) 
Likewise, the automatic collation of updates, such as moved and deleted files,
is a pleasant amelioration to the task of finding what's changed in an archive
since the last published edition. DiffWalk now handles these tasks for me.

If you'd like a better solution to the "Briefcase Problem" (how to synchronize
files across multiple installations with minimal time and fuss), don't forget to
stop by the manual pages for "diff", "patch", and "rsync".

こんにちわ い-よ, わんこ いいんちよ…

Title translates loosely, if at all, as "I ain't say hi to the First Dog."

Discrete mathematics treats of sets (groups of items, of known size) and logic.
The operations on sets are union (OR), intersection (AND), difference (-), and
symmetric difference (XOR). I often call AND conjunction and XOR disjunction.

As for logic, the usual Boolean operations with which programmers are familiar:
 AND: Only the items ("1" bits) in both sets (bytes). 01001 & 01100 == 01000
 OR: All the items ("1" bits) in either set (byte). 01001 | 01100 == 01101
 XOR: Everything in one set or the other, _not_ both. 01001 ^ 01100 == 00101
 NOT: Logical inversion. (Not conversion.) Same as ¬. ! 01001 == 10110
(NOT is a unary operator. The binary operators are commutative. The exclusive OR
 (XOR) operator has this additional property: given, axiomatically, A ^ B = C,
 then C ^ B = A and C ^ A = B. I think that makes XOR evanescent, but I can't
 remember if that's transitivity. Anyhow: symmetric difference self-reverses.
 See also memfrob() @ string.h, & especially Python manual section 6.7 (sets).)

I may have misrepresented some of the details: consult a discrete math text too.
Salient trivia: don't confuse the bitwise intersection (&) & union (|) with the
 boolean operators && and ||. Although equivalent for the values
 1 and 0, they behave differently for all others. For example:
 2 & 5 == 0; although both (bool)2 and (bool)5 typecast to true,
 and therefore (bool)2 & (bool)5 == (bool)2 && (bool)5 == true,
 bitwise & is valid on integers and the compiler will produce the
 integer bitwise & operation which results in false.
Et trivia: xor can be used (cautiously) to logically invert comparisons (^ 1).
(See also: GNU Bison, Extended Backus-Naur Format, formal grammar (maths).)

Combinatorics (the science of counting) is discrete mathematics: in which the
delightful factorial (a unary operator, !, whose operand is to the left, which
distinguishes it from unary logical NOT) makes many an appearance. Combinatorial
explosion is a task for any mathematician who wishes to optimize (by trading
memory for repeat computation) such procedures as decision trees, which are NP.
(See also: algorithmic complexity, non-deterministic polynomial time).

Combinations of a set, because order makes no difference, are fewer than its
permutations. A combiner can be written with sliding windows, as in Python's
"itertools" library. (I've also implemented a heuristic version in Javascript.)
Combinations represent all the ways you'd choose some number of items from a set
such that no two combinations contain exactly the same items.
There are S! / (M! * (S - M)!) M-length combinations chosen from a S-length set.
(X! is "X factorial." Don't confuse it with !X, which is "not X," i.e. "¬X." )

Contrariwise: permutations, in which the same items chosen in a different order
constitute a different permutation than any other, represent all the ways you'd
choose items from a set such that the time when you chose each item matters.
Permutation is most simply and directly implemented by way of recursion.
(Combinations can also be written this way, by narrowing the slices, as I have.)
There are S! / (S - M)! permutations of length M chosen from a set of length S.

The number of "same items, different order" permutations can be arrived upon via
subtraction of the combination formula from the permutation formula:

S! S! M! * S! S! (M! - 1) * S!
-------- - ------------- == ------------- - ------------- == ------------- 
(S - M)! M! * (S - M)! M! * (S - M)! M! * (S - M)! M! * (S - M)!

Because the "N-ary Counter" appears frequently in computer programs that behave
in a combinatorial way, and because I implemented them too in Mutate-o-Matic:
the items in any set, when numbered in monotonic integral progression from 0,
can represent the numerals of numbers in base N (where N == S, to reuse the "set
length" nomial from above). Arithmetic with these figures rarely makes sense,
although hash tables and cryptographic checksums sometimes behave similarly.
There are S ** M (S to the power M; ** is exponentiation in Python) N-ary
"numbers" of length M that are possible by counting in base S.

Combinations and permutations are not defined when the length M of the mutation
exceeds the length of the set. N-ary counters are defined for any M, however.

Today I'll describe a mutator I wrote in C. It permutes, combines, and counts.
It is somewhat more complex than the recursive one-liner you'd sensibly write,
but not much different once you see past all the pointers.

Such a one-liner, before I get carried away, would look something like this...
 def anagrams (w): # Pseudocode (read: I haven't tested it) anagram generator
 if len(w) == 1: # <-- Recursion always requires a sentinel condition,
 yield list(w) # unless you are extremely clever indeed.
 for i in range(len(w)): yield [ w[i] ] + anagrams(w[0 : i] + w[i + 1 :])
... in Python, though my own rendition in Python is somewhat more acrobatic (it
foolishly employs generator expressions); as for combinations, see PyDoc 10.7.1
(itertools.combinations). Python is fast, because it's just a shell around C, &
Python's "itertools" module will do the discrete math acrobatics in record time.
Itertools also contains many of the discrete math axioms I rewrote above -- see
its page in the Python manual, section 10.7, for the details.

Not content to waste nanoseconds, however, I resolved to write a significantly
less intelligible implementation using the C programming language. I ought to
point out that the savings in algorithmic complexity are vanishingly negligible:
it actually probably uses more time than the Python implementation, owing to its
prodigal invocation of malloc() (among my other C vices, such as **indirection).

Anyway, here's the Mutate-o-Matic in Stone Tablet Format courtesy of my C-hisel:
 typedef enum { PERM, COMB, NARY } mmode_t;
 // ^- Respectively, "permutation," "combination," or "N-ary counter" mode.
 void mutate (
 char *set[], // set of items to permute. An array of C-strings.
 size_t slen, // length of the set.
 size_t plen, // length of each permutation.
 mmode_t mode, // kind of mutation I'm executing.
 char delim[], // delimiter, per permutation item.
 char *prefix[] = NULL, // characters chosen already.
 size_t pflen = 0 // length of the prefix.
 ) {
 size_t subset_length = (mode == NARY) ? slen : slen - 1;
 char **next_set = (char**) calloc(subset_length, sizeof(char*));
 char **next_prefix = (char**) calloc(pflen + 1, sizeof(char*));
 if (next_set == NULL || next_prefix == NULL) { exit(-1); }
 for (int i = 0; i < pflen; next_prefix[i] = prefix[i], i++) ;
 for (int i = slen - 1, j, k; i >= ((mode == COMB) ? (int)plen : 0); i--) {
 next_prefix[last_item] = set[i];
 if (plen > 0) { // Either descend the recursion ...
 for (k = 0, j = 0; j < i; next_set[k++] = set[j++]) ;
 switch (mode) {// ^- the above are elements common to every mode
 case NARY: next_set[k++] = set[j]; // XXX NB: fallthrough
 case PERM: for (j++; j < slen; next_set[k++] = set[j++]); break;
 } // (^- conditionally copy elements not common to all three modes)
 mutate(next_set, k, plen, mode, delim, next_prefix, pflen);
 } else { // ... or print the mutation (selected set items).
 for (j = 0; j < last_item; printf("%s%s", next_prefix[j++], delim));
 printf("%s\n", next_prefix[last_item]);
 } // *Sound of cash register bell chiming.*
 } // end for (Select items and mutate subsets)
 free(next_set); free(next_prefix); // Paying tribute to Malloc, plz wait

(To download the C file, plus the rest of my works, retrieve my portfolio by way
 of the links to MediaFire supplied in the top post at my WordPress blog.)

It seems so simple: only an idiot couldn't see how to compute mutations in this
manner, right? However, although I first encountered the problem when I was but
a wee lass, & although I finally discovered the formula for permutations and
N-ary counters after the better part of two !@#$ decades, I yet fail to grok
_why_ the "bounded sliding windows" illustrated by itertools successfully choose
only the combinations. (The best I can do is "because they're in sort order?")

Anyway, the procedure is woefully uncomplicated:
append on the prefix an unchosen item from the set, then either...
 [remaining choices > 0] recurse, having culled chosen item (unless in N-ary
 mode); and, if in Combination mode, having culled
 items subsequent to the sliding window (set[i]),
... or ...
 [choices == 0] print chosen items, which are a mutation, & continue

I'm sure you can see that this whole procedure would be a lot faster if I wasn't
spending precious clock ticks spelunking the dreaded caves of malloc(). A global
array of arrays with entries for each prefix length would really speed things up
-- as would splitting the mutater into three different functions, to eliminate
the branching instructions ("if-else," "switch," and "(predicate) ? a : b").

However, I recommend you don't bother trying to write it faster in C. What with
the huge cache memories CPUs have these days, it's probably plenty zippy. Even
though it _can_ be faster in C, it could be even faster in Assembler, so learn
an instruction set and assemble a core image if that's what you really want.

Until then, for all its inefficiency, Mutate-o-Matic renders about 100,000
mutations per second. That's faster than those sci-fi lockpicks in the movies.

Pan Fried Programming

(Here's the update -- nothing much is new:
MLPTK: http://www.mediafire.com/file/m3u25i445lqkztb/mlptk-2016-12-16.zip
Source Highlight: http://www.mediafire.com/file/ygxb14ie94cwcuy/mlptk-src-hilite-2016-12-16.zip
Book: http://www.mediafire.com/file/vg439qruq3do90q/mlptk-book-2016-12-16.zip

Remedial (adj.): intended to rectify, amend, heal.
Panacea (n., myth): goddess of healing, daughter of Aesculapius.
Pansear (n.): Chili's Pokémon.

This remedial lecture will tersely cover a semester's curriculum,
similar to what you have learnt in your high school algebra class,
comprising the fundamentals of programming with synthetic languages
(those that are above machine code).

If you don't know what computer programming is, I would recommend that you study
some tutorials & encyclopedia articles. Much is available on the WWW (Worldwide
Web). The Web is a part of the Internet, and it is the Web you access from your
Web browser when you navigate to a Web page. You could also try one'a them there
"<name of programming language> For Dummies" textbooks: the "For Dummies" books
are excellent "Cliff's Notes"-style crash courses, and each aims to be similar
to a "101" course in the topic advertised.

To make a beginning with any programming language, all you must know is that a
computer computes: your instructions, issued in the program you write, tell the
machine how to progress from its input or initial state to a resultant output or
final state. These instructions look different in different languages -- some
languages require more or fewer -- but every computer program is an algorithm,
or "recipe for computation."

Computers and computer programs can be characterized as finite state automata.
They're like assembly-line robots who know where to weld each sheet of metal.
Also like assembly-line robots, they are "blind" in the sense that they'll kill
you with the soldering iron should you step between it and the sheet.
Computing machines do what they're told, even when it is particularly stupid:
that's why computer viruses, worms, and computer espionage exist.

In simplest terms, the computer's processing units receive some numbers and an
instruction that says what mathematical operation to execute, then operates:
like a calculator. High-level programming languages are more synthetic, like a
human language is, and comprise such ideas as objects (amalgamations of data) &
functions (modular sub-routines). Compilers or interpreters read these languages
and translate them into machine instructions, simplifying the lengthy series of
instructions necessary to make the calculator execute these difficult tasks.

In a high-level language, there are few technical concerns.
You can begin immediately with the abstract concepts.
Here are some:

As in algebra, a variable is a name that represents a value.
As in solving a system of equations, values are typically assigned to some vars
and the value of the other variables is computed using the values given.
For example, in Javascript:
    var a = 2;
    var b = a + 2;
The variable <b> is now equal to 2 + 2. Similar operations function similarly.
In Javascript and other very-high-level languages, variables aren't only scalars
and can point at any object. They're like placeholders for procedure.
Although "variable" implies a value stored in memory, and "identifier" only its
mnemonic, the words "variable" & "identifier" used loosely mean about the same.
    "Just don't try that with the Captain."
        -- Geordi LaForge, to Data, _Star Trek: the Next Generation._

These are important ideas that are abstracted away in VHLLs. A pointer stores an
address in memory, for a later indirect read/write or similar operation. In HLLs
a pointer/reference accesses an object directly instead of copying its value.
You'll rarely have to make the distinction in Javascript; but, for example:
    var a = new Array(1, 2, 3); // a[0] == 1, a[1] == 2, a[2] == 3
    var b = a; // Incidentally, b === a, and that is why in the next line...
    b[0] = 555; // ... b[0] == 555, and now a[0] also == 555!
As opposed to:
    var c = new Array(3); // c is a new array of length 3
    c[0] = b[0]; c[1] = b[1]; c[2] = b[2]; // copy scalar values one-by-one
    c[0] = 0; // c[0] == 0, but b[0] remains == a[0], which remains == 555.
    var d = 2;
    var e = d;
    e = 4; // e == 4, but d is still == 2.
As you can see, operating on an object (such as via array subscript operation)
changes the object, even if the object is pointed by multiple variables.
Likewise, objects passed as the argument of a function are passed by reference:
they aren't simply copied, and operating on the argument within the function is
equivalent to changing the object, whose scope is above that of the function.
Some high-level languages, like C, permit you to explicitly specify what is a
pointer or reference, which eliminates some of this confusion but requires more
exacting attention to detail in your design specification.

The state of a program is the value of all its variables, the current location
within the instruction set, and the environment of the operating system (or the
interpreter). In Javascript, within Web browsers, the browser typically provides
access to some of its state via the Document Object Model.

Heuristics, or "guesswork," could not exist if there were no way to execute some
different code depending on the state of the program. Furthermore there are some
mathematics you can't write as exactly one set of instructions that produces one
semantic value: for instance, a function defined only on an interval, or an even
root of a positive number. In this circumstance, you are writing branches:
    if (5 > 10) { /* of course, the code in this block never happens. */ }
    else if (2 < 0) { /* neither does this, b/c cond'n is also false. */ }
    else { /* but all of this code happens, because the others didn't. */ }
... One of the branches executes, and the others don't.
The part in parentheses is the "conditional statement:" it's evaluated as either
"true" or "false," like in Boolean logic. 

Identifiers are only valid within the block (curly brackets, or { ... }) where
they were declared. Well, they're supposed to, anyway. Therefore, if you declare
a variable inside a function, you can't use it outside of the function or within
another function. Why would you want to, anyway? The next time you invoked the
function, the value of the variables you were using in there would change again.

Computers are great at repetition. Loops repeat a set of instructions: they are
typically written as a prefix, conditional, postfix, and body. For example:
    for (var T = 10; T > 0; T--) { alert("T minus " + T); }
... which counts down from ten to one with annoying alert popups.
While or do-while loops have only conditions & bodies.
A loop is an example of an "iterative algorithm." Each time the loop's body is
executed, it's called an "iteration." In computing fractal geometric patterns,
"iteration" means more like "recursion:" which, see below.

A function is a modular segment of your program: a sequence of computation that
is repeated a few times, or can be reused as part of another computation.
Functions are "invoked," or called by name, with values supplied as arguments,
and return a value, similarly to how functions behave in algebra. When declaring
a function, you'd typically write the name of the function followed by its args
in parentheses and then the function body. For example, again in Javascript:
    function intp (N) { return (N % 1) == 0; } // integer predicate
... which returns true if N is probably an integer, or whole number:
    if (intp(5)) { alert("Yes. 5 is probably an integer."); }
    if (intp(5.55)) { alert("This box never appears..."); }
    else { alert("... but this one does, because 5.55 is a floater."); }
(Floating-point numbers are inaccurate, in Javascript as likewise elsewhere.)

A function that invokes itself is a recursive function. Any function invoking an
other function, which subsequently causes the original function to be invoked
again, causes a recursion-like situation that I think is called "re-entrancy."
It is essential to note that _any_ and _every_ recursive function you can write
for a computer to execute can be rewritten as an iterative algorithm. The proof
of this is complex: it follows from Alan Turing's model of finite state automata
and the read-execute model of arithmetic and logic units (CPUs), and basically
asserts that you'd never be able to execute recursion if you couldn't do it by
reading one instruction at a time. In other words, each time your function calls
itself again, it's simply storing its state in memory temporarily while the
machine executes a fresh copy: after the copy is done, the former state is re-
loaded, and execution proceeds from there. This is achieved with stacks: data
structures that grow in size as more is "pushed" onto them, and shrink when some
is "popped" off of the top.

An object is a collection of data that comprises a several datum. That is, when
data are related to one another, they can be envisioned as a "shape" or "motion"
that is the sum of its parts. For example, a vector has direction and magnitude;
an individual has a first and last name; a parser has an input stream, a stack,
and a procedure. In Javascript, you'd write something like this:
    function Blah (newz) { if (newz) { this.z = newz; } return this; }
    Blah.prototype = new Object();
    Blah.prototype.z = 555;
    Blah.prototype.tell_me_z = function () { return this.z; }
    var a = new Blah(222), b = new Blah(); // a.z == 222; b.z = 555.
... syntax varies among languages. Essentially an object is a data structure
containing some members ("variables" attached to the object, like Blah::z above)
and, if the object is a class, some methods (functions, like ::tell_me_z).

Palling around.

Pall (n.): pawl.

I couldn't write last week, and my upgrade to QL has progressed no further.
For reference, I stalled before comparing the efficiency of nested Objects to
that of nested Arrays, which I must test before experimenting further with the
prototype compiler or even refining the design. I intend to do that this month.
In the meantime, here's a snapshot of MLPTK with new experiments included.


And a correction to my brief about the grammar ("Saddlebread"): actually, the
InchoateConjugation sequence does not cause differentiation, because the OP_CAT
prevents the original from reducing. Other parts may be inaccurate. I'll revise
the grammar brief and post a new one as soon as I have fixed the QL speed bug.

I took some time out from writing Quadrare Lexema to write some code I've been
meaning to write for a very long time: pal9000, the dissociated companion.
This software design is remarkably similar to the venerable "Eggdrop," whose C
source code is available for download at various locations on the Internets.
Obviously, my code is free and within the Public Domain (as open as open source
can be); you can find pal9000 bundled with today's edition of MLPTK, beneath the
/reference/ directory.

The chatbot is a hardy perennial computer program.
People sometimes say chatbots are artificial intelligence; although they aren't,
exactly, or at least this one isn't, because it doesn't know where it is or what
it's doing (actually it makes some assumptions about itself that are perfectly
wrong) and it doesn't apply the compiler-like technique of categorical learning
because I half-baked the project. Soon, though, I hope...

Nevertheless, mathematics allows us to simulate natural language.
Even a simplistic algorithm like Dissociated Press (see "Internet Jargon File,"
maintained somewhere on the World Wide Web, possibly at Thyrsus Enterprises by
Eric Steven Raymond) can produce humanoid phrases that are like real writing.
Where DisPress fails, naturally, is paragraphs and coherence: as you'll see when
you've researched, it loses track of what it was saying after a few words.

Of course, that can be alleviated with any number of clever tricks; such as:
	1. Use a compiler.
	2. Use a compiler.
	3. Use a compiler.
I haven't done that with p9k, yet, but you can if you want.

Of meaningful significance to chat robots is the Markov chain.
That is a mathematical model, used to describe some physical processes (such as
diffusion), describing a state machine in which the probability of any given
state occurring is dependent only on the next or previous state of the system,
without regard to how that state was encountered.
Natural language, especially that language which occurs during a dream state or
drugged rhapsody (frequently and too often with malicious intent, these are
misinterpreted as the ravings of madmen), can also be modeled with something
like a Markov chain because of the diffusive nature of tangential thought.

The Markov-chain chat robot applies the principle that the state of a finite
automaton can be described in terms of a set of states foregoing the present;
that is, the state of the machine is a sliding window, in which is recorded some
number of states that were encountered before the state existent at the moment.
Each such state is a word (or phrase / sentence / paragraph if you fancy a more
precise approach to artificial intelligence), and the words are strung together
one after another with respect to the few words that fit in the sliding window.
So, it's sort of like a compression algorithm in reverse, and similar to the way
we memorize concepts by relating them to other concepts. "It's a brain. Sorta."

One problem with Markov robots, and another reason why compilers are of import
in the scientific examination of artificial intelligence, is that of bananas.
The Banana Problem describes the fact that, when a Markov chain is traversed, it
"forgets" what state it occupied before the sliding window moved.
Therefore, for any window of width W < 6, the input B A N A N A first produces
state B, then states A and N sequentially forever.
Obviously, the Banana Problem can be solved by widening the window; however, if
you do that, the automaton's memory consumption increases proportionately.

Additionally, very long inputs tend to throw a Markov-'bot for a loop.
You can sorta fix this by increasing the width of the sliding window signifying
which state the automaton presently occupies, but then you run into problems
when the sliding window is too big and it can't think of any suitable phrase
because no known windows (phrases corresponding to the decision tree's depth)
fit the trailing portion of the input.
It's a sticky problem, which is why I mentioned compilers; they're of import to
artificial intelligence, which is news to absolutely no one, because compilers
(and grammar, generally) describe everything we know about the learning process
of everyone on Earth: namely, that intelligent beings construct semantic meaning
by observing their environments and deducing progressively more abstract ideas
via synthesis of observations with abstractions already deduced.
Nevertheless, you'd be hard-pressed to find even a simple random-walk chatbot
that isn't at least amusing.
(See the "dp" module in MLPTK, which implements the vanilla DisPress algorithm.)

My chatbot, pal9000, is inspired by the Dissociated Press & Eggdrop algorithms;
the copy rights of which are held by their authors, who aren't me.
Although p9k was crafted with regard only to the mathematics and not the code,
if my work is an infringement, I'd be happy to expunge it if you want.

Dissociated Press works like this:
	1. Print the first N words (letters? phonemes?) of a body of text.
	2. Then, search for a random occurrence of a word in the corpus
	   which follows the most recently printed N words, and print it.
	3. Ad potentially infinitum, where "last N words" are round-robin.
It is random: therefore, humorously disjointed.

And Eggdrop works like this (AFAICR):
	1. For a given coherence factor, N:
	2. Build a decision tree of depth N from a body of text.
	3. Then, for a given input text:
	4. Feed the input to the decision tree (mmm, roots), and then
	5. Print the least likely response to follow the last N words
	   by applying the Dissociated Press algorithm non-randomly.
	6. Terminate response after its length exceeds some threshold;
	   the exact computation of which I can't recall at the moment.
It is not random: therefore, eerily humanoid. (Cue theremin riff, thundercrash.)

A compiler, such as I imagined above, could probably employ sliding windows (of
width N) to isolate recurring phrases or sentences. Thereby it may automatically
learn how to construct meaningful language without human interaction.
Although I think you'll agree that the simplistic method is pretty effective on
its own; notwithstanding, I'll experiment with a learning design once I've done
QL's code generation method sufficiently that it can translate itself to Python.

Or possibly I'll nick one of the Python compiler compilers that already exists.
(Although that would take all the fun out of it.)

I'll parsimoniously describe how pal9000 blends the two:

First of all, it doesn't (not exactly), but it's close.
Pal9000 learns the exact words you input, then generates a response within some
extinction threshold, with a sliding window whose width is variable and bounded.
Its response is bounded by a maximum length (to solve the Banana Problem).
Because it must by some means know when a response ends "properly," it also
counts the newline character as a word.
These former are departures from Eggdrop.
It also learns from itself (to avoid saying something twice), as does Eggdrop.

In addition, p9k's response isn't necessarily random.
If you use the database I included, or choose the experimental "generator"
response method, p9k produces a response that is simply the most surprising word
it encountered subsequent to the preceding state chain.
This produces responses more often, and they are closer to something you said
before, but of course this is far less surprising and therefore less amusing.
The classical Eggdrop method takes a bit longer to generate any reply; but, when
it does, it drinks Dos Equis.
... Uh, I mean... when it does, the reply is more likely to be worth reading.
After I have experimented to my satisfaction, I'll switch the response method
back to the classic Eggdrop algorithm. Until then, if you'd prefer the Eggdrop
experience, you must delete the included database and regenerate it with the
default values and input a screenplay or something. I think Eggdrop's Web site
has the script for Alien, if you want to use that. Game over, man; game over!

In case you're curious, the algorithmic time complexity of PAL 9000 is somewhere
in the ballpark of O(((1 + MAX_COHERENCE - MIN_COHERENCE) * N) ^ X) per reply,
where N is every unique word ever learnt and X is the extinction threshold.
"It's _SLOW._" It asymptotically approaches O(1) in the best case.

For additional detail, consult /mlptk/reference/PAL9000/readme.txt.

Pal9000 is a prototypical design that implements some strange ideas about how,
exactly, a Markov-'bot should work. As such, some parts are nonfunctional (or,
indeed, malfunction actually) and vestigial. "Oops... I broke the algorithm."
While designing, I altered multiple good ideas that Eggdrop and DisPress did
right the first time, and actually made the whole thing worse on the whole. For
a more classical computer science dish, try downloading & compiling Eggdrop.

On Loggin’.

The post title, "On Loggin'," is a pun on the algorithmic time-complexity of an
ordered binary tree seek operation: O(n * log(n)).

This lecture provides my advice in regard to sorting data by using binary trees.
(From Bouvier's Law Dictionary, Revised 6th Ed (1856) [bouvier]:
 ADVICE, com. law. A letter [...] intended to give notice of [facts contained].)
My lectures are in the Public Domain, and are free of charge.
"By order of the Author." - Mark Twain, _Huckleberry Finn._

A tree is a directed acyclic graph. ("Tree" (1998-11-12) via FOLDOC 30jan2010.)
Other sort methods besides tree sort include quicksort, bubble sort, heap sort,
and insertion sort. ("Sort" (1997-02-12) via FOLDOC, 30th January 2010.)

This pattern can be seen almost everywhere that data are represented as discrete
notional objects: for example, the directory trees contained within the file
systema of your personal computer, the technology/skill trees that are likely to
present themselves in your favorite video game, and in any computer program that
maintains the logical order of a data set as new data are added. These aren't
necessarily treelike data structures; but, because the perceived "shape" of the
data is a tree, they're probably similar if coded in object-oriented language.
I imagine a file system as "what an object even is;" although a computer program
is volatile, whereas files are nonvolatile, the idea is much the same.
See also: object-oriented programming, procedural vs. functional languages, file
systems, {non,}volatile memory such as RAM and EEPROM, sort algorithms.

Of course, treelike structures aren't always the optimum schematic to represent
a hierarchical assortment of data; definitively, like all other object-oriented
designs, they're mere abstractions employed to delineate pieces of a contiguous
unidimensional memory space (the Turing model of a logic machine considers RAM
to be just a big empty pair of set brackets; data are populated & operated-upon
arbitrarily, but the container is merely a set) and to signify correlation.
An object is a collection of related data; which comprise a several datum.
See also: pointers, heap memory, set theory, discrete math, and Boolean logic.

¶ The following (til the next pilcrow) is nearly a verbatim excerpt from FOLDOC.
The data stored in a tree is processed by traversal. The algorithm begins at
root node, transforms or collects or computes against the data, and then repeats
itself for the root's child-nodes ("branches") in some specific order. Three
common traversal orders are pre-order, post-order, and in-order traversal.
For the binary tree illustrated below:
       / \
      I   S
     / \
    D   E
A pre-order traversal visits the nodes in the order T I D E S.
A post-order traversal visits them in the order D E I S T.
An in-order traversal visits them in the order D I E T S.
¶ "Traversal" (2001-10-01) via FOLDOC, ed. 30th January 2010.
See also: recursion, the call stack (subroutine invocation), digraph traversal.

To encode a tree walk via recursive algorithm is straightforward, and is how the
concept of data set traversal is introduced to novitiate computer programmers.
(Well, after they've learned how to loop over an array with "for.")


    function put (pNode, value) {
        if (pNode.isEmpty()) { pNode.datum = value; }
        else if (value < pNode.datum) { put(pNode.bLeft, value); }
        else if (value > pNode.datum) { put(pNode.bRight, value); }
    } // Ordered insertion.

    function tell (pNode, value) {
        if (value < pNode.datum) { return tell(pNode.bLeft, value); }
        else if (value > pNode.datum) { return tell(pNode.bRight, value); }
        else if (pNode.isEmpty()) { return NULL; }
        return pNode;
    } // Seek pointer to inserted.

_Any_ recursive algorithm can be reduced to an iterative algorithm, provided you
can use variable-length arrays to simulate the function of the call stack, which
stores the arguments to the subroutine you invoked and the address to restore to
the instruction pointer when the subroutine returns. That is, the call stack is
like bookmarks or tabbed indices to tell the routine where it was before a jump.
Replace the "instruction pointer and arguments" model of a stack frame with any
constant-width data set that remembers what you were doing before you jumped to
the next node in the digraph, and, naturally, you can remember the same stuff as
if you had utilized the convenient paradigm that is recursivity. (Stack frames
don't really all have to be the same size, but a definite frame width spares you
from doing yet another algorithm to parse frames. See: network protocol stacks.)

The "stuff you remember" is the state of the finite state automaton who tends to
the mechanism whereby the machine knows which instruction to execute next.
Recursion provides this automaton "for free," because it crops up so frequently;
but, for whatever reason (such as the 3000-call recursion limit in Firefox 3.6),
you might want to write a tree sort without using recursion at all.

Gentlemen, behold...
    corm = GenProtoTok(
        function () {
            this.tree = new this.Hop();
            return this;

        "locus", undefined,

        "seek", function (u_val) {
            while (this.locus[2] != undefined) {
                if (u_val > this.locus[2]) { this.locus = this.hop(1); }
                else if (u_val == this.locus[2]) { return false; }
                else { this.locus = this.hop(0); } // u < loc[2] || u == undef
            return true;
        }, // end function Hopper::seek

        "tellSorted", function () {
            if (this.tree == undefined) { return undefined; }
            var evaluation = new Array();
            for (var orca = new Array(this.tree), did = new Array(false);
            ) {
                this.locus = orca.pop();
                if (did.pop()) {
                    if (this.locus[1] != undefined) {
                } else {
                    if (this.locus[0] != undefined) {
            return evaluation;
        }, // end function Hopper::tellSorted

        "hop", function (index) {
            if (this.locus[index]==undefined) {
                this.locus[index] = new this.Hop();
            return this.locus[index];
        }, // end function Hopper::hop

        "addUnique", function (value) {
            this.locus = this.tree;
            if (this.seek(value)) { this.locus[2] = value; return true; }
            return false;
        }, // end function Hopper::addUnique

        "Hop", GenPrototype(Array, new Array())
... corm!

Here is how corm works: it's an Array retrofitted with a binary tree algorithm.

Each node of the tree is an Array whose members, at these indices, signify:
    0. Left-hand ("less-than") branch pointer.
    1. Right-hand ("greater-than") branch pointer.
    2. Value stored in the node.

Values are added via the corm object's addUnique method, which resets a pointer
to the algorithm's location in the tree and then seeks a place to put the datum.
Seeking causes a node to be created, if no node yet contains the datum; so, the
corm::seek method's name is misleading, but seeking is exactly what it does.

In-order traversal is executed by corm::tellSorted which returns the sorted set.
Recursion is simulated using a state stack whose frames each contain a pointer
to the "previous" node and a boolean value that signifies whether its left child
has been visited. Here is the main loop, step-by-step, in natural English:
    for (var orca = new Array(this.tree), did = new Array(false);
    ) {
"Given the stack, which I consider until it is exhausted..."
        this.locus = orca.pop();
"I am here." -^
        if (did.pop()) {
"If I did the left branch already,"
"then this value occurs at this location in the ordered set;"
            if (this.locus[1] != undefined) {
"if the right branch exists,"
"visit the right child on the next loop iteration"
"(and, naturally, I haven't visited the right child's left child yet)"
        } else {
"If I haven't done the left branch already,"
"then I have to visit this node again when I come back"
"(except I shall have visited the left child by then);"
            if (this.locus[0] != undefined) {
"if the left branch exists,"
"visit the left child on the next loop iteration"
"(and, naturally, I haven't visited the left child's left child yet)"
"... I consume pointers and bools, doing all that there stuff"

And that is how to do everything I did with the Hopper data structure
(which is here renamed corm, because I like corm) in quadrare-lexema.js.

Full disclosure: whereas; although I do put on women's clothing; and although
indeed I sleep all night and work all day; and, in addition, despite that I have
been known, of an occasion: to eat eat my lunch, go to the lavatory, pick wild
flowers, hang around in bars, and want to be a girlie; my Wednesdays have not
historically been known to involve shopping or buttered scones, I don't know how
to press a flower, I have neither suspenders nor heels nor a bra (but thanks for
offering), and am not, in fact, a lumberjack.