The MediaGlyphs encoding v4.4 - Technical document
(specifications - rationale - parsing algorithm)

Contents

Instructions on how to encode and decode MediaGlyphs sentences is provided, together with specifications on file locations for display and linking of images and explanation pages. An algorithm and sample perl code are added at the bottom of the document.

Purpose of the encoding

Storage and transmission of codes representing glyphs, glyph-combinations and phonetic names.

Total Alphabet

[0-9] [a-z] [A-Z] [] {} @ ^ + = _

Alphabet explained

Glyph subset alphabet

[0-9] [a-z] [A-Z] {} @

Special symbols subset alphabet

[] ^ + =

Punctuation symbols that can appear in MG sentences

, . ; : ( ) ' " -

Additional symbols used

The " " (NO BREAK SPACE) can be used for human legibility but is squashed and ignored when parsing.

The "_" (LOW LINE) is used for compatibility with filesystems that do not differentiate between uppercase and lowercase letters. It is used in this way: all uppercase letters are followed by "_" to differentiate the filenames.
Hence "aa.png" is different from "A_A_.png" which is different from "aA_.png" and so on.
For transmission and storage of codes, the "_" is not needed, but for filenames (html pages, png files...) it is necessary.
Hence all MG encoded strings will be "escaped" with "_" and "unescaped" removing it, as needed.

When parsing MG codes, " " and "_" are eliminated.

Rationale

Two symbols from the "glyph subset alphabet" are required to specify a glyph.
E.g.: 7O qc w6 eN @f {j l{
all specify single glyphs.
NOTE: "@" will be used only as first symbol specifying a glyph, not appearing in second position.

There are hence 4160 (64*64 + 1*64) possible combinations of the "glyph subset alphabet" to encode a maximum of 4160 single glyphs. We don't expect to reach this maximum number, and instead we plan to keep the number of single glyphs around 2000.

The first symbol (of the two that specify a glyph) indicates the category that the glyph belongs to.
Hence "ja" and "ji" are glyphs in the same category ("numerals").

The symbols from the "special symbols subset alphabet" all have a meaning affecting parsing, because they are involved in specifying composites, glyphs being shifted of category, phrases...

The trivial case: A MG string containing only symbols from the "glyph subset alphabet" would be easily parsed by splitting it in consecutive substrings of length 2, and these would be the codes specifying the glyphs and directly pointing to the image files (.png).

E.g.: "@baH@kQC@bbt" (MG) (SHTML)
(equivalent to "@b aH @k QC @b bt" and to "@baH_@kQ_C_@bbt")
would encode 6 consecutive glyphs (5 unique) whose images are located in the "l/" directory, with filenames: "@b.png" "aH_.png" "@k.png" "Q_C_.png" "bt.png"
('l' stands for 'library', short for 'image library').

Things become slightly more complicated with the special symbols.

Explanation of special symbols and their syntax

More examples: sample sentences

The HTML and the encoded string of the following sample sentences can be compared:

Test and compare

It's possible to compare the result of parsing made by a new program and the existing PHP display system

Parsing algorithm

Remove " " and "_" from the encoded string

For the whole length of the encoded string do:

Parsing algorithm into a commented Perl function

  my @components; # array containing the substrings that will be sent to separate functions that will take care of displaying and linking them
  my @components_what; # array containing the type of what has been put in @components ("." stands for punctuation, "p" for phrase, "oo" for reclarified... This is needed to know which separate function to invoke for each substring inside the @components array. The codes used in this array are the names of the subdirectories in the x/ directory tree.
  while ($i < length($mgtext)) { # for the whole length of the encoded string
    if (substr($mgtext,$i,1) eq "\n") { # carriage return
      push (@components,substr($mgtext,$i,1));
      push (@components_what,"."); # newline
      $i++;
    } elsif ($mg_punctuation{substr($mgtext,$i,1)}) { # punctuation hash lookup
      push (@components,substr($mgtext,$i,1));
      push (@components_what,"."); # commas, colons
      $i++;
    } elsif (substr($mgtext,$i,1) eq "[") { # subphrase begins
      $subphrase=substr($mgtext,$i); # temporary
      # the correct matching "]" needs to be found, separate function used (see below for code)
      ($subphrase_begin,$subphrase_end)=find_first_subphrase_limits($subphrase);
      if ($subphrase_begin == -1) {
        print STDERR "error finding subphrase limits\n";
        return (0);
      } else {
        $subphrase_begin +=$i; $subphrase_end +=$i; # convert in mgtext coordinates
        $subphrase=substr($mgtext,$subphrase_begin+1,$subphrase_end-$subphrase_begin-1);
        $subphrase_length=length($subphrase);
        push (@components,$subphrase);
        push (@components_what,"p"); # subphrase (nested phrase)
        $i += ($subphrase_length + 2); # subphrase + [ + ] (the delimiters)
      }
    } elsif (substr($mgtext,$i,2) eq "+^") { # oi-prefixed : multicomposite
      $mgword=substr($mgtext,$i+2);
      if (index($mgword,"+^") == -1) { # if no terminating "+^"
        print STDERR "multicomposite with beginning \'+^\' but no ending '+^\' in \'$mgtext\'\n";
        return(0);
      } # otherwise let's isolate the +^multicomposite+^
      $mgword= "+^" . substr($mgword,0,index($mgword,"+^")) . "+^";
      $i+=length($mgword); # the +^ are already counted in
      push (@components,$mgword);
      push (@components_what,"oi");
    } elsif (substr($mgtext,$i,2) eq "++") { # oo-prefixed trisyllabic (glyph whose category gets reclarified)
      $mgword=substr($mgtext,$i,5);
      $i+=5;
      push (@components,$mgword);
      push (@components_what,"oo");
    } elsif (substr($mgtext,$i,1) eq "^") { # i-prefixed tetrasyllabic
      $mgword=substr($mgtext,$i,5);
      $i+=5;
      push (@components,$mgword);
      push (@components_what,"i");
    } elsif (substr($mgtext,$i,1) eq "+") { # o-prefixed trisyllabic
      $mgword=substr($mgtext,$i,4);
      $i+=4;
      push (@components,$mgword);
      push (@components_what,"o");
    } elsif (substr($mgtext,$i,2) eq "==") { # uu-prefixed : phonetic names
      $mgword=substr($mgtext,$i+2);
      if (index($mgword,"==") == -1) { # if no terminating "=="
        print STDERR "phonetic name with beginning \'==\' but no ending '==\' in \'$mgtext\'\n";
        return(0);
      } # otherwise let's isolate the ==phonname==
      $mgword= substr($mgword,0,index($mgword,"=="));
      $i+=length($mgword)+4; # the +4 is for the two == tags
      push (@components,$mgword);
      push (@components_what,"uu");
    
    } elsif (substr($mgtext,$i,1) eq "=") { # u-prefixed : glyph names
      $mgword=substr($mgtext,$i+1);
      if (index($mgword,"=") == -1) { # if no terminating "="
        print STDERR "glyph name with beginning \'=\' but no ending '=\' in \'$mgtext\'\n";
        return(0);
      } # otherwise let's isolate the =glyphname=
      $mgword= substr($mgword,0,index($mgword,"="));
      $i+=length($mgword)+2; # the +2 is for the two = tags
      push (@components,$mgword);
      push (@components_what,"u");
    } else { # normal words: single glyph codes
      $mgword=substr($mgtext,$i,2);
      push (@components,$mgword);
      push (@components_what,"l"); # coreword, glyph
      $i+=2;
    }
  }
  return (\@components,\@components_what);
sub find_first_subphrase_limits {
  my ($string)=@_;
  my ($subphrase_begin,$subphrase_end);
  my ($tmp_begin,$tmp_end,$stringtmp);
  my ($begin_count,$end_count);
  $begin_count=($string =~ tr/[//);
  $end_count=($string =~ tr/]//);
  unless ($begin_count) {
    print STDERR "not even one [ in string \'$string\'\n";
    return (-1);
  }
  unless ($end_count) {
    print STDERR "not even one ] in string \'$string\'\n";
    return (-1);
  }
  # otherwise, balance check:
  if ($begin_count != $end_count) {
    print STDERR "unbalanced [] count in string \'$string\'\n";
    return (-1);
  }
  $tmp_begin = index($string,"[");
  $tmp_end = index($string,"]");
  
  if ($tmp_end < $tmp_begin) { # inverted ][ check
    print STDERR "wrong order: ] preceeds [ in string \'$string\'\n";
    return(-1);
  }
  if ($begin_count == 1) { # then also end_count is
    return ($tmp_begin,$tmp_end);
  }
  $subphrase_begin=$tmp_begin;
  $subphrase_end=$tmp_end;
  $stringtmp=substr($string,$subphrase_begin,$subphrase_end-$subphrase_begin+1);
  $begin_count=($stringtmp =~ tr/[//);
  $end_count=($stringtmp =~ tr/]//);
  while ($begin_count != $end_count) {
    $tmp_end=index(substr($string,$subphrase_end+1),"]"); # next "]"
    $subphrase_end += $tmp_end+1;
    $stringtmp=substr($string,$subphrase_begin,$subphrase_end-$subphrase_begin+1);
    $begin_count=($stringtmp =~ tr/[//);
    $end_count=($stringtmp =~ tr/]//);
        #$subphrase=substr($subphrase,0,)
  }
  return($subphrase_begin,$subphrase_end);
}


MediaGlyphs.org
Last modified: Tue Sep 30 22:15:54 China Standard Time 2008 First appearance: Wed Jan 22 19:32:22 GMT 2003