From Dirt to Shovels:
Fully Automatic Tool Generation from Ad Hoc Data
Kathleen Fisher
AT&T Labs Research
HEC 101 11:00 AM
An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily
available. Such data must be queried, transformed, and displayed by
systems administrators, computational biologists, financial analysts
and hosts of others on a regular basis. In this talk, we demonstrate
that it is possible to generate a suite of useful data processing
tools, including a semi-structured query engine, several format
converters, a statistical analyzer, and data visualization routines
directly from the ad hoc data itself, without any human intervention.
The key technical contribution of the work is a multi-phase algorithm
that automatically infers the structure of an ad hoc data source and
produces a format specification in the PADS data description
language. Programmers wishing to implement custom data analysis tools
can use such descriptions to generate printing and parsing libraries
for the data. Alternatively, our software infrastructure will push
these descriptions through the PADS compiler, creating
format-dependent modules that, when linked with format-independent
algorithms for analysis and transformation, result in fully functional
tools. We evaluate the performance of our inference algorithm,
showing it scales linearly in the size of the training data ---
completing in seconds, as opposed to the hours or days it takes to
write a description by hand. We also evaluate the correctness of the
algorithm, demonstrating that generating accurate descriptions often
requires less than 5% of the available data.
More information about the PADS project is available from the web:
http://www.padsproj.org.
This project is joint with David Walker and Kenny Zhu of Princeton
University and David Burke and Peter White of Galois.
BIO:
Kathleen Fisher actively contributes to the field of programming languages, publishing papers in PLDI, POPL, ICFP, ECOOP, DSL, KDD, and TOPLAS. Her early work on the foundations of object-oriented languages led to the design of the class mechanism in Moby. The main thrust of her recent work has been in domain-specific languages to facilitate programming with massive amounts of ad hoc data. In particular, Kathleen initiated and leads the PADS project. PADS is a system that allows data analysts to write declarative descriptions of ad hoc data,
including both physical layout information and semantic constraints. From such descriptions,the PADS system generates tools and applications for manipulating the data. Kathleen is Chair of SIGPLAN,
on the steering committee of CRA-W, and an editor of the Journal of Functional Programming. She has served as program chair for FOOL, ICFP, and CUFP.