Spreading the Use of Big Data in Scientific Research

About the series

Speakers: Alexander Szalay and Stuart Feldman

Abstract: Big Data is about data-intensive computing, analysis at scale, iteration and understanding. It takes advantage of supercomputing, massive simulations, real-time filtering but has distinct engineering points to maximize economic access to massive amounts of bits. Scientists talk about “big data” but most do not yet have large quantities. For many gigabytes are significant and a few terabytes are aspirational. Deep learning absorbs enormous amounts of specialized cycles. There are huge possibilities for scientific progress using these relatively new approaches. New generations of instruments will produce orders of magnitude more observations, global catalogues will be enormous, and new techniques are waiting to be applied (machine learning of course, in its many variations, but Bayes rules). Yes, LHC, LSST are in the petabyte range, but this is still the exception.

In practice, industry is far ahead of the academic research world in crafting huge storage because they faced larger quantities sooner, had sharper network needs and requirements for distributed access. They had the engineering resources to solve multiple problems. To many universities, a petabyte is a scary quantity, not a building block. Institutions have slow refresh cycles and naturally use vendor-based general solutions and install storage systems that are designed for other purposes. Massive cloud systems follow a different design point, and this can be applied for more local storage. Universities and their funders need to shorten the interval between industrial demonstration and new uses. Schmidt Futures decided to prime the pump by supporting Prof. Szalay’s plan to build practical large storage nodes as a seed for a massive research storage network. His team engineered solutions based on leading-edge COTS components and their own open source software. We are on the way: Szalay’s group has designed and tested a stable 1.5 PB box that can saturate 40 Gb/s networks and cost a bit over $100K, and NSF has started to build out the Open Storage Network (OSN).

The goal of the OSN project is to create a robust, industrial strength national storage substrate that can impact a large fraction of the NSF research community, and offer a way to build a common basis for the cyberinfrastructure of NSF MREFC projects. The challenge is more in social engineering than a technical one: how can one convince the academic community to embrace a more homogeneous, standardized data storage solution. Through deploying an inexpensive standardized data appliance with about 1.5PB of storage, and a streamlined, simple interface we hope to bring scientists closer to be comfortable with petabyte-scale science, and increase the agility of academic research. The OSN project, co-funded by the NSF and the Schmidt Foundation is aiming at an at-scale demonstration of the feasibility of the idea. The four NSF Big Data Regional Innovation Hubs represent a large fraction of the US Community interested in data-intensive research, and will test the OSN environment with real science challenges. These appliances, once deployed at close to full scale, could form one of the world’s largest scientific data management facility, and also present a gateway to even larger-scale data sets located in the clouds. We expect that many of the current data services will interface, possibly migrate to the OSN. The talk will also discuss the challenges of running large scientific data sets over extended periods.

Bio: Alexander Szalay is the Bloomberg Distinguished Professor at the Johns Hopkins University, with a joint appointment in the Departments of Physics and Astronomy and Computer Science. He is the Director of the Institute for Data Intensive Science and Engineering (IDIES). He is a cosmologist, working on the statistical measures of the spatial distribution of galaxies and galaxy formation. He has been the architect for the archive of the Sloan Digital Sky Survey. He is a Corresponding Member of the Hungarian Academy of Sciences, and a Fellow of the American Academy of Arts and Sciences. In 2004 he received an Alexander Von Humboldt Award in Physical Sciences, in 2007 the Microsoft Jim Gray Award. In 2008 he became Doctor Honoris Causa of the Eotvos University, Budapest. In 2015 he received the Sidney Fernbach Award of the IEEE for his work on Data Intensive Computing. He enjoys playing with Big Data.

Bio: Stuart Feldman is Chief Scientist of Schmidt Futures where he is responsible for the Scientific Knowledge programs, including creating fellowship programs, supporting nascent innovative research projects, and driving new platforms and larger research projects that aim to change the way scientific research is done and the way universities operate. Stuart Feldman did his academic work in astrophysics and mathematics and earned his AB at Princeton and his PhD at MIT. Feldman is best known for writing "Make" and other essential tools. He was awarded an honorary Doctor of Mathematics by the University of Waterloo. He is former President of ACM (Association for Computing Machinery) and former member of the board of directors of the AACSB (Association to Advance Collegiate Schools of Business). He received the 2003 ACM Software System Award. He is a Fellow of the IEEE, ACM, and AAAS. He is Board Chair of the Center for the Minorities and Disabled in IT, serves on a number of university advisory boards and National Academy panels, and has served on a wide variety of government advisory committees.

To view the webinar, please register at: http://www.tvworldwide.com/events/nsf/180814