E-Book, Englisch, 176 Seiten
Berman Repurposing Legacy Data
1. Auflage 2015
ISBN: 978-0-12-802915-2
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)
Innovative Case Studies
E-Book, Englisch, 176 Seiten
Reihe: Computer Science Reviews and Trends
            ISBN: 978-0-12-802915-2 
            Verlag: Elsevier Science & Techn.
            
 Format: EPUB
    Kopierschutz: Adobe DRM (»Systemvoraussetzungen)
Jules Berman holds two Bachelor of Science degrees from MIT (in Mathematics and in Earth and Planetary Sciences), a PhD from Temple University, and an MD from the University of Miami. He was a graduate researcher at the Fels Cancer Research Institute (Temple University) and at the American Health Foundation in Valhalla, New York. He completed his postdoctoral studies at the US National Institutes of Health, and his residency at the George Washington University Medical Center in Washington, DC. Dr. Berman served as Chief of anatomic pathology, surgical pathology, and cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the US National Institutes of Health as a Medical Officer and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics and is the 2011 recipient of the Association's Lifetime Achievement Award. He is a listed author of more than 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and pathology. Dr. Berman is currently a freelance writer.
Autoren/Hrsg.
Weitere Infos & Material
1;Front Cover;1
2;Repurposing Legacy Data;4
3;Copyright Page;5
4;Contents;6
5;Author Biography;8
6;1 Introduction;10
6.1;1.1 Why Bother?;10
6.2;1.2 What Is Data Repurposing?;12
6.3;1.3 Data Worth Preserving;13
6.4;1.4 Basic Data Repurposing Tools;15
6.4.1;1.4.1 A Simple Text Editor;15
6.4.2;1.4.2 Simple Programming Skills;15
6.4.3;1.4.3 Data Visualization Utilities;16
6.5;1.5 Personal Attributes of Data Repurposers;17
6.5.1;1.5.1 Data Organization Methods;18
6.5.2;1.5.2 Ability to Develop a Clear Understanding of the Goals of a Project;18
6.6;References;19
7;2 Learning from the Masters;20
7.1;2.1 New Physics from Old Data;20
7.2;2.2 Repurposing the Physical and Abstract Property of Uniqueness;22
7.3;2.3 Repurposing a 2,000-Year-Old Classification;23
7.4;2.4 Decoding the Past;28
7.5;2.5 What Makes Data Useful for Repurposing Projects?;33
7.6;References;41
8;3 Dealing with Text;44
8.1;3.1 Thus It Is Written;44
8.2;3.2 Search and Retrieval;47
8.3;3.3 Indexing Text;50
8.4;3.4 Coding Text;56
8.5;References;58
9;4 New Life for Old Data;60
9.1;4.1 New Algorithms;60
9.2;4.2 Taking Closer Looks;65
9.3;4.3 Crossing Data Domains;68
9.4;References;71
10;5 The Purpose of Data Analysis Is to Enable Data Reanalysis;74
10.1;5.1 Every Initial Data Analysis on Complex Datasets Is Flawed;74
10.2;5.2 Unrepeatability of Complex Analyses;79
10.3;5.3 Obligation to Verify and Validate;81
10.4;5.4 Asking What the Data Really Means;85
10.5;References;88
11;6 Dark Legacy: Making Sense of Someone Else’s Data;92
11.1;6.1 Excavating Treasures from Lost and Abandoned Data Mines;92
11.2;6.2 Nonstandard Standards;94
11.3;6.3 Specifications, Not Standards;97
11.4;6.4 Classifications and Ontologies;100
11.5;6.5 Identity and Uniqueness;106
11.6;6.6 When to Terminate (or Reconsider) a Data Repurposing Project;110
11.7;References;114
12;7 Social and Economic Issues;118
12.1;7.1 Data Sharing and Reproducible Research;118
12.2;7.2 Acquiring and Storing Data;119
12.3;7.3 Keeping Your Data Forever;121
12.4;7.4 Data Immutability;122
12.5;7.5 Privacy and Confidentiality;124
12.6;7.6 The Economics of Data Repurposing;127
12.7;References;130
13;Appendix A: Index of Case Studies;132
14;Appendix B: Glossary;134
14.1;References;167
Learning from the Masters
Data repurposing has made it possible for scientists to lay the foundations of quantum mechanics, evolution, and modern astronomy. It has enabled us to understand past lives (e.g., Mesoamerican history and culture) and has given us a way to identify every organism on earth (e.g., biometrics). This chapter explains the pivotal role played by data repurposing in these important intellectual achievements.
Keywords
CODIS; identifiers; biometrics; finger prints; quantum physics; heliocentric system; Mayan glyphs
2.1 New Physics from Old Data
All science is description and not explanation.
Karl Pearson, The Grammar of Science, Preface to 2nd edition, 1899
Case Study 2.1
Sky Charts
For most of us, the positions of the planets and of the stars do not provide us with any useful information. This was not always so. For a large part of the history of mankind, individuals determined their locations, the date, and the time, from careful observations of the night sky. On a cloudless night, a competent navigator, on the sea or in the air, could plot a true course.
Repurposed data from old star charts was used to settle and unsettle one of our greatest mysteries; earth’s place in the universe. Seven key scientists, working in tandem over a period of four centuries, used night sky data to reach profound and shocking conclusions: Aristarchus of Samos (circa 310–230 BCE), Nicolaus Copernicus (1473–1543), Tycho Brahe (1546–1601), Johannes Kepler (1571–1630), Galileo Galilei (1564–1642), Isaac Newton (1643–1727), and Albert Einstein (1879–1955).
Back in the third century BCE, Aristarchus of Samos studied the night sky and reasoned that the earth and planets orbited the sun. In addition, Aristarchus correctly assigned the relative positions of the known planets to their heliocentric orbits. About 1,800 years later, Copernicus reanalyzed Aristachus’ assertions to confirm the heliocentric orbits of the planets, and plotted their elliptic trajectories. Soon thereafter, Tycho Brahe produced improved star charts, bequeathing this data to his student, Johannes Kepler. Kepler used the charts to derive three general laws describing the movement of planets. In 1687, Newton published his Principia, wherein Kepler’s empiric laws, based on observational data, were redeveloped from physical principles, Newton’s laws of motion. Newton’s contribution was a remarkable example of data modeling, wherein an equation was created to describe a set of data pertaining to physical objects (see Glossary item, Modeling).
As is almost always the case, this multigenerational repurposing project led to a conceptual simplification of the original data. After the switch was made from a geocentric to a heliocentric system, operating under a simple set of equations, it became far easier to calculate the relative motion of objects (e.g., planetary orbits) and to predict the position of celestial bodies.
From Newton’s work, based on Kepler’s elliptical orbits, based in turn on Tycho Brahe’s data, came the calculus and Newton’s theory of relativity. Newton, as well as his predecessor Galileo, assumed the existence of an absolute space, within which the laws of motion hold true. The planets, and all physical bodies, were thought to move relative to one another in their own frames of reference, within an absolute space, all sharing an absolute time. Einstein revisited Newton’s theories of relativity and concluded that time, like motion, is relative and not absolute.
The discovery of heliocentric planetary motion and the broader issues of relative frames of observation in space were developed over more than 2,000 years of observation, analysis, and reanalysis of old data. Each successive scientist used a prior set of observations to answer a new question. In so doing, star data, originally intended for navigational purposes, was repurposed to produce a new model of our universe.
Case Study 2.2
From Hydrogen Spectrum Data to Quantum Mechanics
In about 1880, Vogel and Huggins published the emission frequencies of hydrogen (i.e., the hydrogen spectroscopic emission lines) [1,2]. In 1885, Johann Balmer, studying the emission frequencies of the hydrogen spectral lines, developed a formula that precisely expressed frequency in terms of the numeric order of its emission line (i.e., =1, 2, 3, 4, and so on). Balmer’s attempt at data modeling produced one of the strangest equations in the history of science. There was simply no precedent for expressing the frequency of an electromagnetic wave in terms of its spectral emission rank. The formula was introduced to the world without the benefit of any theoretical explanation. Balmer himself indicated that he was just playing around with numbers. Nonetheless, he had hit upon a formula that precisely described multiple emission lines, in terms of ascending integers.
Twenty-eight years later, Niels Bohr, in 1913, chanced upon Balmer’s formula and used it to explain spectral lines in terms of energy emissions resulting from transitions between discrete electron orbits. Balmer’s amateurish venture into data repurposing led, somewhat inadvertently, to the birth of modern quantum physics.
2.2 Repurposing the Physical and Abstract Property of Uniqueness
L’art c’est moi, la science c’est nous.
Claude Bernard
An object is unique if it can be distinguished from every other object. The quality of object uniqueness permits data scientists to associate nonunique data values with unique data objects; hence, identifying the data. As an example, let us examine the utility of natural uniqueness for the forensic scientist.
Case Study 2.3
Fingerprints; from Personal Identifier to Data-Driven Forensics
Fingerprints have been used, since antiquity, as a method for establishing the identity of individuals. Fingerprints were pressed onto clay tablets, seals, and even pottery left by ancient civilizations that included Minoan, Greek, Japanese, and Chinese. As early as the second millennium BCE, fingerprints were used as a type of signature in Babylon, and ancient Babylonian policemen recorded the fingerprints of criminals, much as modern policemen do today (Figure 2.1).
Figure 2.1 U.S. Federal Bureau of Investigation Fingerprint Division, World War II. FBI, public domain (see Glossary item, Public domain).
Towards the close of the nineteenth century, Francis Galton repurposed fingerprint data to pursue his own particular interests. Galton was primarily interested in the heritability and racial characteristics of fingerprints, a field of study that can best be described as a scientific dead end. Nonetheless, in pursuit of his interests, he devised a way of classifying fingerprints by patterns (e.g., plain arch, tented arch, simple loop, central pocket loop, double loop, lateral pocket loop, and plain whorl). This classification launched the new science of fingerprint identification, an area of research that has been actively pursued and improved over the past 120 years (see Glossary item, Classification).
In addition to Galton’s use of classification methods, two closely related simple technological enhancements vastly increased the importance of fingerprints. The first was the incredibly simple procedure of recording sets of fingerprints, on paper, with indelible ink. With the simple fingerprint card, the quality of fingerprints improved, and the process of sharing and comparing recorded fingerprints became more practical. The second enhancement was the decision to collect fingerprint cards in permanent population databases. Fingerprint databases enabled forensic scientists to match fingerprints found at the scene of a crime, with fingerprints stored in the database. The task of fingerprint matching was greatly simplified by confining comparisons to prints that shared the same class-based profiles, as described by Galton.
Repurposing efforts have expanded the use of fingerprints to include authentication (i.e., proving you are who you claim to be), keying (e.g., opening locked devices based on an authenticated fingerprint or some other identifying biometric), tracking (e.g., establishing the path and whereabouts of an individual by following a trail of fingerprints or other identifiers), and body part identification (i.e., identifying the remains of individuals recovered from mass graves or from the sites of catastrophic events). In the past decade, flaws in the vaunted process of fingerprint identification have been documented, and the improvement of the science of identification is an active area of investigation [3].
Today, most of what we think of as the forensic sciences is based on object identification (e.g., biometrics, pollen identification, trace chemical investigation, tire mark investigation, and so on). When a data object is uniquely identified, its association with additional data can be collected, aggregated, and retrieved, as needed.
2.3 Repurposing a 2,000-Year-Old Classification
Our similarities are different.
Yogi Berra
Classifications drive down the complexity of knowledge domains and lay bare the relationships among different objects. Observations that hold for a data object may also hold for the other objects of the same class and for their class descendants (see Glossary item, Class). The data analyst can...





