Skip navigation SIMASystems Integration for Manufacturing Applications NIST - National Institute of Standards and Technology
ToolsPublicationsPublicationsResearch ProjectsAbout SIMAContactHome  
 
Technical Research Projects
Developing, Searching and Exploiting Semantic Mathematical Content
Principal Investigators: Abdou Youssef
(301) 975-5067
abdou.youssef@nist.gov

Bruce Miller

(301) 975-2708
bruce.miller@nist.gov

Objective:
To continue the development of LaTeXML, our tool for converting LaTeX documents into XML, with a renewed emphasis on boosting the semantic content in ways that enhance its utility as part of the `Semantic Web', as well as the ability to search for mathematical information within the documents. Coupled with this will be continued development of a mathematical search engine leveraging this content, as well as research in the best ranking and display of search results.

Background:
Traditionally, LaTeX markup is primarily for math formatting (i.e., presentation) rather than semantic encoding of content. Our preferred strategy, adopted for the DLMF project, improves the situation by encouraging more semantic markup. Completely semantic markup becomes unwieldy, however, and does not address the needs of legacy documents. We are therefore researching declarative, heuristic and type-analysis methods to infer the intended content. We are assessing a method of assigning ambiguous semantics to mathematical symbols to represent the set of possible meanings each symbol can take. This may give rise to refinement techniques to prune inconsistent interpretations, or select the best. Even where it does not, the information becomes richer and more amenable to search. Formalizing ambiguous symbols such as are found in OpenMath (or MathML3) content dictionaries may provide a useful ontology for further work. Additionally, flexible parsers that can produce multiple parse trees may prove useful. By partitioning the processing into stages that generate plausible parse trees while allowing successive semantic refinements provides a method for generating representations from the purely presentational to those that progressively approach the mathematical meaning. Thus, LaTeX-based projects like the DLMF can produce deeper content, where the extra markup effort is repaid, while analyzing legacy material in the arXiv, for example, can still obtain useful, but less exact, content.

Research in robust search indexing that can work with math in these various stages of refinement will be carried out. We have recently shifted the emphasis from indexing the original LaTeX representation of the mathematics towards indexing the generated presentation MathML. In so doing, we will create a more reusable module, since indexing and search of MathML will have broader application than of LaTeX alone; further work will be performed to abstract out a generic math search module, separate from any DLMF-specific functionality. Additionally, by indexing the MathML, the indexed tokens are keyed to the markup used to display summaries of hits in the search results. It thus allows for fine-grained highlighting of the specific math symbols and structures that match the search query, giving a more easily assessed hit list. Preliminary results for the highlighting are promising, but refinements are needed and will be carried out to improve the selectivity of the terms to highlight.

One other search related research problem that will be addressed is improved stemming that is more appropriate for math search. Stemming is the process of stripping suffixes from words, resulting in roots that may or may not be linguistically valid. For example, the words “generate”, “generates’, “generating”, “generated”, and “generation”, all stem to “generat”. The stemming process is usually applied on the contents before indexing, and on the query keywords before searching, so that a query like “generate” would match a document that contains “generated”. This feature has worked very well in text search, including Google. In math search, however, such stemming reduces precision; for instance when a user searching for “generating” (as in “generating functions”) gets irrelevant documents about “generation of random numbers”. We have begun working on developing a new stemmer that stems words to other linguistically valid words and that preserves mathematically relevant nuances. In this project, we will complete this stemmer and test its effectiveness in the context of math search.

The findings and techniques resulting from this project are being applied to the DLMF in the immediate term, and will form the base for a potential new project: building a Web-wide math search system.

 

  Back to list of all projects
 

Page created October 2008

  Last updated:
 

Web site point of contact