Workloads generated by astronomy queries over the Internet cannot be cached by existing distributed caching techniques. In particular, both Web caching, in proxies and browsers, and semantic (query) caching cannot address the large data sizes in astronomy queries. Specifically, astronomy workloads do not exhibit the query reuse and query containment upon which semantic (query) caching relies. Also, astronomy queries transfer large data items frequently, which flushes Web caches. Scientific database federations are geographically distributed and network bound. Thus, they could benefit from proxy caching. However, existing caching techniques are not suitable for their workloads, which compare and join large data sets. Existing techniques reduce parallelism by conducting distributed queries in a single cache and lose the data reduction benefits of performing selections at each database. We have developed the bypass-yield formulation of caching, which reduces network traffic in wide-area database federations, while preserving parallelism and data reduction. Bypass-yield caching is altruistic; caches minimize the overall network traffic generated by the federation, rather than focusing on local performance. We have developed an adaptive, workload-driven algorithm for managing a bypass-yield cache. We also have developed on-line algorithms that make no assumptions about workload: a k-competitive deterministic algorithm and a randomized algorithm with minimal space complexity. We have verified the efficacy of bypass-yield caching by running workload traces collected from the Sloan Digital Sky Survey through a prototype implementation.
Bypass-Yield Caching for Large-Scale Scientific Database Workloads in the World-Wide Telescope
PI Randal Burns, Department of Computer Science, Johns Hopkins University
co-PIs Ani Thakar, Center for Astrophysical Sciences, Johns Hopkins University
NSF Award IIS-0430848, 10/01/2004-9/30/2007
The World-Wide Telescope (WWT) is a virtual observatory that federates astronomy and astrophysics databases at a global scale, with the ultimate goal of unifying all on-line data and making it available to everyone from everywhere. It dramatically improves the ability to perform multi-spectral and temporal studies by allowing researchers to access many databases with a single query. In its current form, increasing the number of sites and users in the WWT leads inevitably to a network crisis. As data-intensive scientific applications increase in scale, bandwidth constrains the performance of all applications sharing a network. As more scientists and educators adopt and rely on the WWT, the increased bandwidth requirements will degrade the performance of all applications. Federations need to focus on being good "network citizens," using shared resources conscientiously. If not, the workloads generated by these applications will make them unwelcome on public networks.
To avert the network crisis, this project will develop and release an open-source, commodity caching appliance based on two crucial technologies: bypass-yield caching and self-organizing database storage. Bypass-yield caching is an altruistic caching framework for scientific database workloads that balances parallelism in federations against the benefits of caching. It adopts "network citizenship" as its principal goal -- caching in order to minimize network traffic. Database caching introduces an acute storage management problem for which traditional administration is inappropriate. The dynamic creation and destruction of tables in a cache requires automated, incremental storage management with low space overhead. Self-organizing database storage automates storage management and database organization, turning the cache into an administration-free appliance.
Caching appliances are an enabling technology, making it possible for the WWT to accept a large number of users without impeding the performance of shared networks. Open-source software and commodity hardware make the acquisition of the appliance inexpensive and the self-organization of the cache makes it maintenance-free. The caching appliance will enhance astrophysical and astronomy research, making it possible for scientists to conduct experiments and find correlations across heterogeneous data sets at previously unforeseen rates. The WWT and the caching gateways will also bring telescope research and education to communities of users for which it was previously unavailable, particularly undergraduates and high school students. Project plans include outreach in the form of a pilot program that will install and maintain WWT gateways at high schools, colleges, and science museums and libraries, and assist those institutions in curriculum development.
Reports
- Project Highlight on New Directions. Processing Spatial Joins at Global Scale. Report to NSF. January, 2007.
Publications (from this project)
- X. Wang, R. Burns, and T. Malik. LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases. Conference on Innovative Data Systems Research (CIDR), ACM, 2009.
- T. Malik and R. Burns. Workload-Aware Histograms for Remote Applications. International Conference on Data Warehousing and Knowledge Discovery (DaWaK), 2008.
- X. Wang, T. Malik, D. Dash, R. Burns, and A. Ailamaki. Automated Physical Design in Database Caches. Workshop on Self-Managing Database Systems (SDS), IEEE, 2008.
- X. Wang, R. Burns, and A. Terzis. Throughput-Optimized, Global-Scale Join Processing in Scientific Federations. Workshop On Networking Meets Databases (NetDB), USENIX, 2007.
- X. Wang, R Burns, A. Terzis, and A. Deshpande. Network-Aware Join Processing in Global-Scale Database Federations. International Conference on Data Engineering (ICDE), IEEE, 2008
- X. Wang, T. Malik, R. Burns, S. Papadomanolakis, and A. Ailamaki. A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching. Database Systems for Advanced Applications (DASFAA), IEEE, 2007.
- T. Malik, R. Burns, N. Chawla. An Argument for a Black-Box Approach to Query-Result-Size Estimation. Conference on Innovative Data Systems Research (CIDR), ACM, 2007.
- T. Malik, R. Burns, N. Chawla, and A. Szalay. Estimating Query Result Sizes for Proxy Caching in Scientific Database Federations. Supercomputing, (SC), ACM, IEEE, 2006.
- T. Malik, R. Burns, and A. Chaudhary. Bypass Caching: Making Scientific Databases Good Network Citizens. International Conference on Data Engineering (ICDE), 2005.
Personnel:
- Tanu Malik , Ph.D. Student
- Xiaodan Wang, Ph.D. Student
- Nolan Li, Ph.D. Student
- Joshua Kirschstein, Masters Student
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


