Blog Entry
Today's experiement with Markov chains has been really entertaining. I have added a few heuristics to the basic two word markov chain described in The Practice of Programming to handle the vageries of blog entries a bit better. Also, I made walking the chain more random, so that the generated entries are different every time. The markov program presented in the K&P book always begins with the first word found in the data set. This makes for boring blog entries.
The first program presented snarfs blog entries of particular userID from use.perl using the SOAP interface. It creates simple data files that can be easily consumed by the markov chain program.
get_blog
The really interesting work is done by the next program, which consumes
the data and generates a output suitable for posting back to use.perl via
the perl scripts I described in my article. Notice that I keep hyperlinks together. Also
blockquotes. Some may call this cheating, but I think the effect is more
pleasant. I experimented with three work chains, but this wasn't random
enough for me. Perhaps I need a larger data set. A future version of this
program could use both chains and somehow blend them together for the output.
It would be great to see popular three word combinations appear in the
generated text, but perhaps I already get that with two word chains.
Also note that this script marks my first use of the qr
operator. I guess compiled regexes aren't so scary after all.
markov.pl
!i, qr!
!i ],
blockquote => [ qr!!i, qr!
!i ],
);
WORD:
for my $word (split /\s+/, $line) {
# am I in a special block?
# can't start a new special until I find the end of previous one
for my $el (qw(href ecode ul blockquote)) {
if ($in->{$el}) {
if ($word =~ /$delims{$el}[0]/) { # end?
$word = "$in->{$el} $word";
$in->{$el} = "";
} else {
$in->{$el} .= " $word";
next WORD;
}
} elsif ($word =~ /$delims{$el}[0]/) { # start?
$in->{$el} = $word;
next WORD;
}
}
if (CHAIN_SIZE > 2) {
push @{$tbl->{$w1}->{$w2}->{$w3}}, $word;
($w1, $w2, $w3) = ($w2, $w3, $word); # pull the chain along
} else {
push @{$tbl->{$w1}->{$w2}}, $word;
($w1, $w2) = ($w2, $word); # pull the chain along
}
}
# assign these keys back into the passed in ref
return @{$keys} = ($w1, $w2, $w3);
}
sub end_table {
my ($tbl) = shift(@_);
my ($w1, $w2, $w3) = @{shift(@_)};
if (CHAIN_SIZE > 2) {
push @{$tbl->{$w1}->{$w2}->{$w3}}, NONWORD;
} else {
push @{$tbl->{$w1}->{$w2}}, NONWORD;
}
}
sub make_chain {
my ($tbl, $size) = @_;
my $text = "";
# let's start in a rand point on the chain
my @w1 = ((keys %{$tbl}), NONWORD);
my $w1 = $w1[rand(@w1)];
my @w2 = ((keys %{$tbl->{$w1}}), NONWORD);
my $w2 = $w2[rand(@w2)];
my @w3 = (NONWORD);
if (CHAIN_SIZE > 2) {
@w3 = ((keys %{$tbl->{$w1}->{$w2}}), NONWORD);
}
my $w3 = $w3[rand(@w3)];
for my $i (0..$size) {
my $suf;
if (CHAIN_SIZE > 2) {
$suf = $tbl->{$w1}->{$w2}->{$w3};
} else {
$suf = $tbl->{$w1}->{$w2};
}
warn("word1: '$w1'\n\tword2: '$w2'\n") if DEBUG;
unless (ref $suf) {
$w1 = $w1[rand(@w1)];
$w2 = $w2[rand(@w2)];
$w3 = $w3[rand(@w3)];
redo;
}
my $r = int(rand @{$suf});
my $t = $suf->[$r];
if ($t eq NONWORD) {
warn ("detected the end of the chain ("
. (keys %{$tbl})
. ") at $i. reseting keys\n") if DEBUG;
$w1 = $w1[rand(@w1)];
$w2 = $w2[rand(@w2)];
$w3 = $w3[rand(@w3)];
next;
}
# there are "unbalanced" braces (close nuff for me)
if ($t !~ m!\([^\)]*\)!) {
$t =~ s!^\(!!;
$t =~ s!\)$!!;
$text .= "$t ";
}
if (CHAIN_SIZE > 2) {
($w1, $w2, $w3) = ($w2, $w3, $t);
} else {
($w1, $w2) = ($w2, $t);
}
}
# do some goofy clean up
$text = ucfirst $text;
chop $text; # final space
# remove stray punctuation
$text =~ s/[,:]$//;
if (substr($text, -1, 1) ne '.') {
$text .= ".";
}
$Text::Wrap::columns = 60;
return wrap("", "", $text);
}
Let's see how annoying this gets. I have a feeling after a week, I'll make this quietly go away. Or expand it into a something truly monstrous. I made several tweaks to the code just trying to post this blog entry.
UPDATE: Thanks to the wonders of Soviet-style revisionism, I have
updated this code a bit to remove the uses variables $w1,$w2,$w3
from the main line. Also, I randomize the input file order to make the output
less likely to come from the some person's blog. Perhaps I'll bundle this up
for CPAN or taskboy.com or something.

![[advertisement]](/blog/img/taskboy_ad_feedbag.gif)