T. V. Raman
The advent of electronic documents makes information available in more than its visual form; electronic information can now be display-independent. In this article, the author describes a computing system, AsTeR, that audio formats electronic documents to produce audio documents. AsTeR can speak both literary texts and highly technical documents (presently in La)TeX) that contain complex mathematics. Visual communication is characterized by the eye's ability to actively access parts of a two-dimensional display. The reader is active, while the display is passive. This active-passive role is reversed by the temporal nature of oral communication: information flows actively past a passive listener. This prohibits multiple views - it is impossible to first obtain a high-level view and then "look" at details. These shortcomings become severe when presenting complex mathematics orally.
Audio formatting, which renders information structure in a manner attuned to an auditory display, overcomes these problems. AsTeR is interactive, and the ability to browse information structure and obtain multiple views enables active listening.
This article describes a system for producing audio renderings. Print is not the ideal medium for describing such renderings, (and ASCII is an even poorer one!). RFB members can acquire an audio formatted version of the author's thesis, (this article is a slightly edited version of the first chapter) rendered by AsTeR, from Recording for the Blind (RFB order number FB190). Non-RFB customers may request a two track (standard commercial format) tape of AsTeR examples. Requests should be addressed to info@RFB.org; ask for Raman's Math Examples Tape.
Finally, readers with access to the WWW can experience an interactive demo of AsTeR at
http://www.cs.cornell.edu/Info/People/raman/aster/aster-toplevel.html
or
http://www.research.digital.com/CRL/personal/raman/aster/aster-toplevel.html
Documents encapsulate structured information. Visual formatting renders this structure on a two-dimensional display (paper or a video screen) using accepted conventions. The visual layout helps the reader recreate, internalize and browse the underlying structure. The ability to selectively access portions of the display, combined with the layout, enables multiple views. For example, a reader can first skim a document to obtain a high-level view and then read portions of it in detail.
The rendering is attuned to the visual mode of communication, which is characterized by the spatial nature of the display and the eye's ability to actively access parts of this display. The reader is active, while the rendering itself is passive.
This active-passive role is reversed in oral communication: information flows actively past a passive listener. This is particularly evident in traditional forms of reproducing audio, e.g., cassette tapes. Here, a listener can only browse the audio with respect to the underlying time-line -- by rewinding or forwarding the tape. The passive nature of listening prohibits multiple views -- it is impossible to first obtain a high-level view and then "look" at portions of the information in detail.
Traditionally, documents have been made available in audio by trained readers speaking the contents onto a cassette tape to produce "talking books." Being non-interactive, these do not permit browsing. They do have the advantage that the reader can interpret the information and convey a particular view of the structure to the listener. However, the listener is restricted to the single view present on the tape. In the early 1980's, text-to-speech technology was combined with OCR (Optical Character Recognition) to produce "reading machines." In addition to being non-interactive, renderings produced from scanning visually formatted text convey very little structure. Thus, the true audio document was non-existent when we started our work.
We overcome these problems of oral communication by developing the notion of audio formatting-and a computing system that implements it. Audio formatting renders information structure orally, using speech augmented by non-speech sound cues. The renderings produced by this process are attuned to an auditory display audio layout present in the output conveys information structure. Multiple audio views are enabled by making the renderings interactive. A listener can change how specific information structures are rendered and browse them selectively. Thus, the listener becomes an active participant in oral communication.
In the past, information was available only in a visual form, and it required a human to recreate its inherent structure. Electronic information has opened a new world: information can now be captured in a display-independent manner -- using, e.g., tools like SGML and LaTeX (1). Though the principal mode of display is still visual, we can now produce alternative renderings, such as oral and tactile displays. We take advantage of this to audio-format information structure present in LaTeX documents. The resulting audio documents achieve effective oral communication of structured information from a wide range of sources, including literary texts and highly technical documents containing complex mathematics.
The results of this thesis are equally applicable to producing audio renderings of structured information from such diverse sources as information databases and electronic libraries. Audio formatting clients can be developed to allow seamless access to a variety of electronic information, available on both local and remote servers. Thus, the server provides the information, and various clients, such as visual or audio formatters, provide appropriate views of the information. Our work is therefore significant in the area of developing adaptive computer technologies.
Today's computer interfaces are like the silent movies of the past! As speech becomes a more integral part of human-computer interaction, our work will become more relevant in the general area of user-interface design, by adding audio as a new dimension to computer interfaces.
AsTeR (2) is a computing system for producing audio renderings of electronic documents. The present implementation works with documents written in the TeX family of markup (3) languages, i.e., TeX, LaTeX and AMSTeX. But the design of AsTeR is not restricted to any single markup language. Though motivated by the need to render technical documents, our system works equally well on structured documents from the non-technical subjects.
AsTeR is founded on the belief that all information is display-independent. Information has structure, and this structure is rendered on paper or on a visual display, but the information itself is not restricted to these output modes. Thus, AsTeR renders this same information in audio. AsTeR recognizes the logical structure of a document as embodied in the markup source and represents this structure internally. The internal representation is then rendered in audio by applying a collection of rendering rules written in AFL, a language for audio formatting. Think of AFL as a high-level audio analogue to a visual rendering language like Postscript. Rendering an internalized high-level representation enables AsTeR to produce different audio views of the information. A user can either listen to entire documents, or browse the internal structure and selectively read portions of a document. The rendering and browsing components of AsTeR can work equally well with high-level representations we may get from sources such as OCR-based document recognition.
This article gives a high-level view of how the various components of AsTeR are used. AsTeR is implemented in CLOS (4) with an Emacs front-end. The recommended way of using the system is to run Lisp as a subprocess of Emacs. Throughout this chapter, we will assume familiarity with basic Emacs concepts. Section 3 introduces the system by showing how simple documents can be read and browsed. Section 4 explains how AsTeR can be extended to read newly defined document structures in La)TeX (5). Section 5 gives some examples of changing between different ways of rendering the same information. Section 6 presents some advanced techniques that can be used to advantage when reading complex documents such as text books. AsTeR can render information produced by various sources. We give an example of this by demonstrating how AsTeR can be used to interact with the Emacs calculator, a full-fledged symbolic algebra system.
This section assumes that AsTeR has been installed and initialized. At this point, text within any file being visited in Emacs (in general, text in any Emacs buffer), can be rendered in audio. To listen to a piece of text, mark it using standard Emacs commands and invoke read-aloud-region (6). This results in the marked text being audio formatted using a standard rendering style. The text can constitute an entire document or book; it could also be a short paragraph or a single equation from a document. AsTeR renders both partial and complete documents.
This is the simplest and also the most common type of interaction with AsTeR. All markup commands appearing in the text are recognized to produce audio renderings that reflect the structure represented by the markup. The input may be plain ASCII text; in this case, AsTeR will still recognize the minimal document structure present, i.e., paragraph breaks, quoted text etc. La)TeX markup helps the system recognize more of the document logical structure, and as a consequence produce more sophisticated renderings.
Next to getting the system to speak, the most important thing is to get it to stop speaking. Once an audio rendering has been launched, rendering can be interrupted at any time by executing reader-quit-reading (7) The listener can then traverse the internal structure by moving the current selection, which represents the current position in the document, by executing any of the browser commands reader-move-previous, reader-move-next, reader-move-up or reader-move-down.
To orient the user within the document structure, the current
selection is summarized by verbalizing a short message of the
form "
ABC = 0
produces the message "left hand side is a product ". The user has
the option of either listening to just the current selection, or
reading the rest of the document. In the interest of brevity, we
will not give all of the browser key-bindings.
- To read technical articles and books: The files for such
documents may be available on the local system or on the global
Internet (8). Resources retrieved over the network can be audio
formatted by AsTeR since they are just text in Emacs buffers.
Currently, the system audio formats 10 text books available to
the author on his local system. In addition, AsTeR also renders a
wide collection of technical documents available on the Internet
including technical reports and AMS bulletins.
- For entertainment: At present about 200 electronic texts are
available on the Internet, in addition to the complete works of
Shakespeare. The majority of these documents are in plain ASCII,
but the quality of audio renderings produced by AsTeR based on
the minimal document structure that can be recognized still
surpasses conventional reading machines. Increased availability
of electronic texts marked up in La)TeX, SGML and HTML will
enable better recognition of document structure, and as a
consequence, better audio renderings.
- In proof-reading: This feature is especially useful when
typesetting complex mathematical formulae. AsTeR can render both
partial and complete documents. Thus, although designed as a
system for reading documents, the flexible design, combined with
the power afforded by the Emacs editor, turns AsTeR into a very
useful document preparation aid.
As explained in the previous section, the quality of audio
renderings produced by AsTeR is dependent on how much of the
document logical structure is recognized. Authors of La)TeX
documents often use their own macros (9) to encapsulate specific
structures. AsTeR of course does not know of these extensions to
start with. Occurrences of user-defined La)TeX macros are
initially rendered in a canonical way; typically, the
user-defined macros are read aloud as they appear in the running
text.
Thus, given a document containing
$A \kronecker B$
AsTeR would produce
cap a kronecker cap b
In this case, this canonical rendering is quite acceptable.
In general, how AsTeR renders such user-defined structures is
fully customizable. The first step is to extend the recognizer to
handle the new construct, in this case \kronecker. Here, we give
the reader a brief example of how this mechanism is used in
practice.
The recognizer is extended by calling Lisp macro
define-text-object. In the case of the \kronecker macro, this
call takes the form:
(define-text-object :macro-name "kronecker" :number-
args 0 :processing-function kronecker-expand :object-
name kronecker :supers (binary-operator) :precedence
multiplication)
This extends the recognizer to represent instances of macro
"kronecker" as instances of object kronecker-product. The user
can now define any number of ways in which an instance of object
kronecker-product should be rendered.
AFL, our language for audio formatting, is used to define
rendering rules. Here, we give a rendering rule for object
kronecker-product.
(def-reading-rule (kronecker-product simple)
which produces
cap a kronecker product cap b
for the input text shown earlier.
Notice, however, that the rendering rule is free to render the
use of the kronecker product in more complex ways; in particular,
the order in which the expression is spoken can be completely
independent of how it appears on paper. Thus, it is
straightforward to write a rendering rule that produces
"The kronecker product of A and B "
AsTeR derives its power from representing document content
internally as objects and by allowing several user-defined
rendering rules for individual object types. Such rendering rules
can cause any number of audio events, ranging from speaking a
simple phrase to playing a digitized sound, when an instance of a
particular object type is rendered. The mechanism for extending
the recognizer affords this same power when rendering user-
defined constructs. Once the recognizer has been extended by an
appropriate call to define-text-object, such constructs can be
handled just as well as any standard La)TeX construct.
AsTeR can produce more than one kind of rendering for a given
object. When perusing printed information, a reader has the
luxury of viewing a complex piece of mathematics from different
perspectives, and AsTeR provides this same functionality. The
listener can switch between any of several pre-defined renderings
for a given object, or add to these by defining new rendering
rules. Switching between different rendering rules produces
different audio views of a given object.
Activating a rendering rule is the simplest way of changing how a
given object is rendered. Statement
(activate-rule
activates rule
Suppose we wish to skip all instances of verbatim text in a LaTeX
document. We could define the following quiet rendering rule:
(def-reading-rule (verbatim quiet) nil)
and activate it by executing
(activate-rule 'verbatim 'quiet)
To later hear the verbatim text in a document, rule quiet is
deactivated by executing
(deactivate-rule 'verbatim)
Notice that at any given time, only one rendering rule is active
for any object. Hence, we only need specify the object when
deactivating a rendering rule. AsTeR provides an Emacs interface
to activating and deactivating rendering rules.
Activating a single rendering rule is a convenient way of
changing how a specific object is rendered. Rendering styles
allow making more global changes to the renderings. Activating
style style-1 by executing
(activate-style 'style-1)
makes the rendering rule named style-1 active for all objects for
which this rendering rule is defined. All other objects continue
to be rendered as before. This is also true when a sequence of
rendering styles is successively activated.
Thus, activating rendering styles is a convenient way of
progressively customizing the rendering of a complex document.
The effect of activating a style can be undone at any time by
executing
(deactivate-style
AsTeR provides the following rendering styles:
- Variable-substitution: Use variable substitution when rendering
complex mathematical expressions.
- Use-special-pattern: Recognize special patterns in mathematical
expressions to produce context-specific renderings.
- Descriptive: Produce descriptive, context-specific renderings
for mathematical expressions.
- Simple: Produce a base-level audio notation for mathematical
expressions.
- Default: Produce default renderings.
- Summarize: Provide a short summary.
- Quiet: Skip objects.
When AsTeR is initialized, the following styles are active:
(use-special-pattern descriptive simple default)
with the leftmost style the most recently activated style.
Defining a new rendering style amounts to defining a collection
of rendering rules having the same name. Note that a rendering
style need not provide rendering rules for all objects in the
document logical structure. As explained earlier, activating a
rendering style only affects the renderings of those objects for
which the style provides a rule.
This section demonstrates some advanced features of AsTeR that
are useful when rendering complex documents. AsTeR recognizes
cross-references and allows the listener to traverse these as
hypertext links. Cross-referenceable objects can be labelled
interactively and these labels used when referring to such
objects within renderings. The ability to switch between
rendering rules allows the listener to quickly locate portions of
interest in a document. By activating rendering rules, all
instances of a particular object can be floated to the end of the
containing hierarchical unit, or entirely skipped. This is
convenient when getting a quick overview of a document. AsTeR
also provides a simple bookmark facility for marking positions of
interest to be returned to later. Finally, AsTeR can be
interfaced with sources of structured information other than
electronic documents. We demonstrate this by interfacing AsTeR to
the Emacs calculator.
Cross-reference tags occurring in the body of a document are
represented internally as instances of object cross-reference and
contain a link to the object being referenced. How such cross-
reference tags are rendered of course depends on the currently
active rule for object cross-reference . The default rendering
rule for cross-references presents the user with a summary of the
object being cross-referenced, e.g., the number and title of a
sectional unit. This is followed by a non-speech audio prompt.
Pressing a key at this prompt results in the entire
cross-referenced object being rendered at this point. Reading
continues if no key is pressed within a certain time interval. In
addition, the listener can interrupt the rendering and move
through the cross-reference tags. This is useful in cases where
many such tags occur within the same sentence.
By theorem 2.1 and lemma 3.5 we get equation 8 and
hence the result.
If the above looks abstruse in print, it sounds meaningless in
audio. This is in fact a serious drawback when listening to
mathematical books on cassette where it is practically impossible
to locate the cross-reference. AsTeR is more effective since
these cross-reference links can be traversed; but traversing each
link while listening to the above proof can be distracting.
Typically, we only glance back at the cross-references to get
sufficient information about what theorem 2.1 is about. AsTeR
provides a convenient mechanism for building in such information
into the renderings. When a cross-referenceable object such as an
equation is rendered, the system verbalizes an automatically
generated label, i.e., the equation number, and then generates an
audible prompt. If the user presses a key at this prompt, he can
specify a more meaningful label which will be used in preference
to the system-generated label when rendering cross-reference
tags.
To continue the current example, when listening to theorem 2.1,
the user could have specified the label "Fermat's theorem". Then
the proof shown earlier would be read as:
By Fermat's theorem and lemma3 .5 we get equation 8
and hence the result.
Of course, the user could have specified labels for the other
cross-referenced objects as well, in which case the rendering
produced almost obviates the need to look back at the cross-
referenced objects.
Printed books allow the reader to skim through the text and
quickly locate portions of interest. Experienced readers use
several different techniques to achieve this. One of these is to
locate an equation or table of interest, and then read the text
surrounding this object. AsTeR provides this functionality to
some extent.
We explained in Section 4 that different rules can be activated
to change the type of renderings produced. Using this mechanism,
we can activate a rendering rule that only reads the equations
occurring in a document. Once an equation of interest is located,
rendering can be interrupted and the rendering rule changed.
Using the browser, the listener can now move the current
selection to the enclosing hierarchical unit and then read the
surrounding text.
Rendering rules can be activated to obtain different views of a
document. For instance, activating rendering rule quiet for
object paragraph provides a thumb-nail view of a document.
Activating rendering rule quiet is a convenient way of
temporarily skipping over all occurrences of a specific object.
We often do this when perusing printed documents; we skip over
complex material at the first reading and return to these later.
We may skip instances of some objects entirely e.g., source code;
in other cases we may merely defer the reading. This notion of
delaying the reading of an object is aptly captured by the
concept of floating an object to the end of the enclosing unit.
Typesetting systems like La)TeX permit the author to float all
figures and tables to the end of the containing section or
chapter. However, only specific objects can be floated, and this
is exclusively under the control of the author, not the consumer
of the document.
AsTeR provides a much more general framework for floating
objects. Any object can be floated to the end of any enclosing
hierarchical unit, e.g., instances of object footnote can be
floated to the end of the containing paragraph. The ability to
float objects is very useful when producing audio renderings.
This is because audio takes time, and it is advantageous to delay
the rendering of some objects when obtaining an overview. Printed
documents use footnotes and floating figures for precisely this
reason. The interactive nature of AsTeR allows us to extend this
functionality.
The browser provides a simple bookmark facility for marking
positions of interest to be returned to later. Browser command
mark-read-pointer bound to C-b m prompts for a bookmark name and
marks the current selection. The listener can later read the
object at this marked position, or move the current selection to
the marked position by executing browser command follow-bookmark
and specifying the appropriate bookmark name.
When reading complex mathematics in print, we often get a high-
level view of an equation first, and read the leaves of an
expression once we have understood the top-level structure. Thus,
when presented with a complex equation, an experienced reader of
mathematics might view it as an equation with a double summation
on the left-hand-side and a double integral on the right-hand-
side, and only then attempt to read the equation in full detail.
In an audio rendering that simply produces a linear rendering,
the temporal nature of audio prevents a listener from getting
such high-level views. We compensate by providing a variable
substitution rendering style. When active, this results in AsTeR
replacing sub-expressions in complex mathematics with meaningful
phrases. Having thus provided a top-level view, AsTeR then reads
the sub-expressions that were substituted for earlier upon
request.
AsTeR has been presented as a system for reading documents. More
generally, AsTeR is a system for presenting structured
information in audio. This fact is amply demonstrated by the
following example where we interface AsTeR to the Emacs
calculator, a full-fledged symbolic algebra system.
The Emacs calculator is a public domain symbolic algebra system
available under the terms of the GNU license. It provided an
excellent source of examples for trying out the variable
substitution rendering style for mathematical expressions.
Providing an audio interface to a symbolic algebra system is
challenging since the expressions produced are quite complex. The
flexible design of AsTeR and the power of Emacs makes this
interface easy. AsTeR can render any information present in an
Emacs buffer. The output of the Emacs calculator satisfies this
requirement. A collection of Emacs Lisp functions arranges for
the output from the calculator to be sent to AsTeR.
A user of the Emacs calculator can now perform a computation and
execute command read-previous-calc-answer to have the output
rendered by AsTeR. The expression can be browsed, summarized,
transformed by applying variable substitution, and the rendering
manipulated in any of the ways described so far in the context of
documents.
(1) Standard Generalized Markup Language (SGML) captures
information in a layout independent form; LaTeX, designed by
Leslie Lamport, is a document preparation system based on the
TeX typesetting system developed by Donald Knuth.
(2) In real life, AsTeR is the name of the author's guide-dog, a
big friendly black Labrador.
(3) To most people, "markup" means an increase in the price of an
article. Here, "markup" is a term from the publishing and
printing business, where it means the instructions for the
typesetter, written on a typescript or manuscript copy by an
editor. Typesetting systems like LaTeX have these commands
embedded in the electronic source. A markup language is a set of
means (constructs) to express how text (i.e., that which is not
markup) should be processed, or handled in other ways.
(4) clos (Common Lisp Object System) is an object oriented
extension of Common Lisp.
(5) In this article, the notation La)TeX represents the entire
"family" of markup languages including TeX, LaTeX, and AMSTex.
(6) This is an Emacs Lisp command, and in the author's setup, it
is bound to C-z d.
(7) reader-quit-reading Bound to C-b q.
(8) ANGE-FTP, an Emacs utility written by Andy Norman, allows
seamless access to such files. In addition, Emacs clients are
available for networked information retrieval systems like
GOPHER, WWW and WAIS.
(9) Macros permit an author to define new language constructs in
TeX and specify how these constructs should be rendered on paper.
(1)
3.2 EXAMPLES OF USE
AsTeR can be used:
4. EXTENDING ASTER
"Simple rendering rule for object kronecker-product."
(read-aloud (first (children kronecker-product)))
(read-aloud "kronecker product")
(read-aloud (second (children kronecker-product))))
5. PRODUCING DIFFERENT RENDERINGS OF THE SAME OBJECT
6. USING THE FULL POWER OF AsTeR
6.1 Cross-References
6.2 Labelling a cross-referenceable object
Consider a proof that reads:
6.3 Locating portions of interest
6.4 Getting an overview of a document
6.5 Bookmarks
6.6 Reading using variable substitution
6.7 Interfacing AsTeR with other information sources
NOTES