Large-Scale Open Access for Research and Outreach

Paul Ginsparg

Paul Ginsparg is being honored as a Champion of Change for the vision he has demonstrated and for his commitment to open science.

In 1991, before the WorldWideWeb, before the general population was even aware of the internet, physicists had already begun to share pre-publication versions of their articles via email.  As a research staff member at Los Alamos National Laboratory, I was concerned that this private sharing unintentionally gave privileged access to more established researchers. To help rectify the situation, I set up a centralized automated repository and alerting system, making cutting-edge full-text articles accessible to anyone with internet access.   For the initial group of high energy theoretical physicists, the system had the immediate positive effect of democratizing the exchange of information within an entire global research community.

By 1993, the advent of the WorldWideWeb suggested ever broader possibilities, with research communication for all fields ported to the new on-line medium.  The arXiv system itself quickly grew from a few hundred submissions per year to many tens of thousands, and moved to the Cornell University Library in 2001. Its growth continued to accelerate, and it now receives close to 100,000 new open access submissions per year (see graphs.) With nearly a million open access articles in the repository, and hundreds of millions of full-text downloads per year, it serves as the primary daily information feed for global communities of researchers in physics, mathematics, computer science, and related fields. Its proof-of-concept served as the prototype for many other modern open access systems that disseminate scientific research results.

When the general public arrived on the internet by the mid 1990's, there emerged intriguing possibilities for engaging beyond the professional research community.  Scientists write articles in order to have them read, hence the more readers the better. Research output, like any public good, becomes more valuable the more it is used.

Twenty years ago, I began discussing with officials at NSF and DOE the possibility of creating open repositories for the articles that result from federally funded research. In 1997, biologists (and by then as well mathematicians and computer scientists) interested in how physicists were sharing on-line information set up a series of discussions which ultimately led to NIH's PubMedCentral repository (on whose initial advisory board I served) and, a decade later, to the NIH mandate to deposit. The OSTP policy memorandum expanding this mandate to other large federal funding agencies, making "the results of federally funded research freely available to the public within one year of publication" is thus welcome progress, and perhaps the time delay can eventually be reduced. arXiv.org already plays an extremely valuable role in giving access to federally-funded research articles to the general public; conventional news sites and blogs link directly to multiple articles, frequently bringing in hundreds of thousands of readers to popularly accessible articles.

My own work has remained focused on making these systems so useful to researchers that they participate spontaneously, and so (like YouTube) no mandate has ever been required. Most recently, in collaboration with computer scientists (at Cornell, Rutgers, and Princeton), I've been implementing a variety of new "big data" datamining tools for the purpose of analyzing usage and information genealogy, using arXiv's unique combination of open access texts and twenty years of longitudinal usage data. Some of the results of this work will go on-line in a new experimental interface later this year, with improved ability to track both long- and short-term trends through the literature, a new recommender system to help users cope with information overload, and new interdisciplinary means of information navigation and discovery. This work should further clarify the benefits to both the research community and the general public of having fully open access research text aggregated and treatable as computable objects.

Continued growth in distributed network databases, new interoperability protocols, machine-readable document standards, and relevant ontologies will build on these components to catalyze more rapid scientific progress, and provide integration of educational resources for the general public.

Paul Ginsparg is a Professor of Physics and Information Science at Cornell University.

Your Federal Tax Receipt