 |
Developing, Searching and Exploiting Semantic
Mathematical Content
Principal
Investigators: Abdou Youssef
(301) 975-5067
abdou.youssef@nist.gov
Bruce Miller (301) 975-2708
bruce.miller@nist.gov
Objective:
To continue the development of LaTeXML, our
tool for converting LaTeX documents into XML, with a renewed emphasis on
boosting the semantic content in ways that enhance its utility as part of the
`Semantic Web', as well as the ability to search for mathematical information
within the documents. Coupled with this will be continued development of a
mathematical search engine leveraging this content, as well as research in the
best ranking and display of search results. Background:
Traditionally, LaTeX markup is primarily for math formatting (i.e.,
presentation) rather than semantic encoding of content. Our preferred strategy,
adopted for the DLMF project, improves the situation by encouraging more
semantic markup. Completely semantic markup becomes unwieldy, however, and does
not address the needs of legacy documents. We are therefore researching
declarative, heuristic and type-analysis methods to infer the intended content.
We are assessing a method of assigning ambiguous semantics to mathematical
symbols to represent the set of possible meanings each symbol can take. This may
give rise to refinement techniques to prune inconsistent interpretations, or
select the best. Even where it does not, the information becomes richer and more
amenable to search. Formalizing ambiguous symbols such as are found in OpenMath
(or MathML3) content dictionaries may provide a useful ontology for further
work. Additionally, flexible parsers that can produce multiple parse trees may
prove useful. By partitioning the processing into stages that generate plausible
parse trees while allowing successive semantic refinements provides a method for
generating representations from the purely presentational to those that
progressively approach the mathematical meaning. Thus, LaTeX-based projects like
the DLMF can produce deeper content, where the extra markup effort is repaid,
while analyzing legacy material in the arXiv, for example, can still obtain
useful, but less exact, content.
Research in robust search indexing that can work with math in these various
stages of refinement will be carried out. We have recently shifted the emphasis
from indexing the original LaTeX representation of the mathematics towards
indexing the generated presentation MathML. In so doing, we will create a more
reusable module, since indexing and search of MathML will have broader
application than of LaTeX alone; further work will be performed to abstract out
a generic math search module, separate from any DLMF-specific functionality.
Additionally, by indexing the MathML, the indexed tokens are keyed to the markup
used to display summaries of hits in the search results. It thus allows for
fine-grained highlighting of the specific math symbols and structures that match
the search query, giving a more easily assessed hit list. Preliminary results
for the highlighting are promising, but refinements are needed and will be
carried out to improve the selectivity of the terms to highlight.
One other search related research problem that will be addressed is improved
stemming that is more appropriate for math search. Stemming is the process of
stripping suffixes from words, resulting in roots that may or may not be
linguistically valid. For example, the words “generate”, “generates’,
“generating”, “generated”, and “generation”, all stem to “generat”. The stemming
process is usually applied on the contents before indexing, and on the query
keywords before searching, so that a query like “generate” would match a
document that contains “generated”. This feature has worked very well in text
search, including Google. In math search, however, such stemming reduces
precision; for instance when a user searching for “generating” (as in
“generating functions”) gets irrelevant documents about “generation of random
numbers”. We have begun working on developing a new stemmer that stems words to
other linguistically valid words and that preserves mathematically relevant
nuances. In this project, we will complete this stemmer and test its
effectiveness in the context of math search.
The findings and techniques resulting from this project are being applied to the
DLMF in the immediate term, and will form the base for a potential new project:
building a Web-wide math search system.
|