Data Debase

Steven Spielberg's 2002 film Minority Reportdepicted a futuristic dystopia in which a "Department of Pre-Crime" jails people for acts they haven't yet committed. In an apparent case of life imitating art, Spielberg's Shoah Foundation -- a nonprofit the film director established in 1994 to videotape and preserve the testimonies of Holocaust survivors -- has handed over the words of 51,000 Holocaust survivors to aid a team of government-funded researchers in developing a revolutionary technology inspired by elements of the Pentagon's scuttled domestic-surveillance program, Total Information Awareness (TIA).

TIA, which was blocked by Congress, was to have been a mix of high-tech voice-recognition and data-mining programs that would have made up the largest domestic-surveillance system in the United States. Using TIA, intelligence analysts and law-enforcement officials would have been able to trawl through Americans' private records -- including banking transactions, e-mail accounts and travel records, such as plane ticket information -- in order to finger terrorists. Though TIA would have been composed largely of voice-recognition surveillance programs, it was the proposal's reliance on the new science of data mining that had groups from the American Civil Liberties Union to the archconservative Free Congress Foundation up in arms.

Broadly speaking, data mining is an innovation of statistical science that allows analysts to detect patterns of events and relationships in order to discover a "gem," or a hidden fact. The idea is to allow users to accurately forecast future events. When data mining emerged in the late 1990s, MIT's Technology Review hailed it as one of the 10 new technologies that will "change the world." Since then, data mining has revolutionized everything from how companies monitor customers' online purchasing habits to how the federal government practices counterterrorism.

Data-mining advocates within the law-enforcement and intelligence communities claim the science makes retrieval of existing information more convenient, allowing them to identify and track terrorists without costly and time-consuming legwork. However, data-mining programs like TIA -- which would have allowed analysts to sift through private citizens' personal records without a search warrant in order to identify patterns that would suggest terrorist activity -- pose some pretty serious due-process problems. Data mining seeks to classify a person's threat level according to superficial patterns of activity like bank withdrawals and travel history. Supporters say this is useful because terrorists typically lead transient lifestyles and have spotty pasts. But so do homeless people, migrant workers and more than a few journalists. Perhaps the most troubling aspect of data mining is that it offers the government a convenient tool to analyze individuals' political and religious affiliations by investigating library records, magazine subscriptions and group memberships. Of course, there is no indication that this is actually happening, but it is a possibility that civil-liberties advocates worry about.

Though Congress elected to defund TIA -- in a nearly unanimous Sept. 24 vote on defense budget appropriations -- the Bush administration is backing a series of TIA-inspired data-mining programs set for implementation by other government agencies and private companies. The Department of Homeland Security's Transportation Security Agency is planning to employ an airline-security program in 2004 called Customer Assisted Passenger Profiling II (CAPPS II), which will use data mining to assign color codes to passengers based on their potential threat levels. A private data-mining company, Seisint Inc., has received funding from the departments of Justice and Homeland Security for a program called Multistate Anti-Terrorism Information Exchange (Matrix), which is set for implementation in five states. And the National Science Foundation (NSF) is funding a data-mining project -- led by university researchers and Spielberg's Shoah Foundation -- that would allow users to index massive audio recordings by specific spoken words or phrases.

And TIA itself has not really gone away. In July, as the Senate seemed poised to ban data-mining programs entirely, the White House's Office of Budget and Management protested the restriction of what it called "a powerful potential tool in the war on terrorism," stating, "The administration urges the Senate to remove the provision that prohibits any research and development for the Terrorism Information Awareness
[TIA] program." As a result of White House pressure, the bill ultimately agreed upon by a joint House-Senate appropriations committee made allowances for TIA's voice-recognition programs to continue in a research-and-development phase at the Pentagon, while its data-mining programs were transferred to the National Foreign Intelligence Program (NFIP), an agency with a classified budget jointly managed by the CIA, FBI and National Security Agency (NSA). The bill also glaringly avoided the term "data mining," instead authorizing the NFIP to employ "processing, analysis and collaboration tools for counterterrorism foreign intelligence." And the committee limited the NFIP to tracking "non-citizens"-- meaning, in effect, that foreigners living and working in the United States are still subject to being tracked by TIA. In July, as the Senate seemed poised to ban
data-mining programs entirely, the White House's Office of Budget and Management protested the restriction of what it called "a powerful potential tool in the war on terrorism," stating, "The administration urges the Senate to remove the provision that prohibits any research and development for the Terrorism Information Awareness [TIA] program."

So TIA's most controversial features live on, though in a limited scope and shrouded in the thick of a bureaucratic wilderness. Case in point: The Pentagon's Counterintelligence Field Activity division, a group that works to protect the Defense Department and its personnel from espionage threats, has recently been charged with conducting a data-mining mission, which, according to the Los Angeles Timesí William Arkin, includes "process[ing] massive sets of public records, intercepted communications, credit card accounts, etc., to find 'actionable intelligence.'"

As David Sobel, general counsel of the Electronic Privacy Information Center, told me, the bevy of data-mining surveillance programs popping up in government agencies is a trend that will continue to be difficult to monitor. "At the moment there is no government-wide restriction on what we commonly think of as data mining," Sobel explained. "Similar programs and the development of related technologies are likely to be driven underground, and at the moment I don't think there's a very good sense on what those initiatives might be."

Information has surfaced recently revealing that one of the government's major data-mining initiatives, CAPPS II, was developed largely "underground" with the help of private corporations unaccountable to the public. It is not known exactly when CAPPS II's development was initiated, though the program likely began in December 2001, when, according to The Washington Times, NASA's Aviation Systems Division obtained more than 15 million private passenger records from Northwest Airlines after a secret meeting between officials from the two organizations. According to documents obtained by the Times in August 2002, NASA officials solicited Northwest's records for use as research data in developing what would have amounted to a mind-reading device. NASA proposed in these documents to detect and analyze passengers' brainwaves, heartbeat rates and eye-flicker rates and correlate them with data on their travel routines, criminal background and credit information from "hundreds to thousands of data sources" to "determine who is a threat."

Though NASA's bizarre proposal has yet to come to fruition, the development of a strikingly similar program came to light this September, when Jet Blue Airlines admitted to handing over the personal records of 5 million customers in early 2002 to Torch Concepts, a private data-mining contractor hired by the Defense Department. TSA officials have subsequently admitted that they facilitated the handover of Jet Blue's records to Torch; they told Wired News that the study was for a program to improve security on U.S. Army bases.

However, a look at Torch's test, which bears no mention of the military, suggests something altogether different. According to Torch documents discovered online by travel privacy activist Edward Hasbrouck, the test correlated Jet Blue customers' records with their Social Security numbers, income levels and home ownership statuses to group customers into one of three categories based on their perceived threat level: young, middle-income homeowners; older, upper-income homeowners; and a group of passengers with "anomalous" records.

Torch's method of classification looks like a blueprint for CAPPS II, which would require airline passengers to provide carriers with their home addresses, phone numbers and dates of birth for entry in a government-administered computer system. That information would be correlated with government and commercial data including bank account information and travel records. Finally, passengers would be placed in
one of three color-coded categories based on their perceived threat levels. Those deemed "anomalous" in Torch's experiment would have been assigned a yellow code under CAPPS II and subjected to additional security checks; those judged nonthreatening would earn a green code and board smoothly; and those whose names showed up on a watch list would be labeled red and then barred from flying or arrested.

As with TIA, CAPPS II has renewed the debate in Congress over the appropriate place of data mining in a democratic society. Led by the technology's staunchest opponent on Capitol Hill, Sen. Ron Wyden (D-Ore.), lawmakers have taken steps to further restrict its use, making CAPPS II's funding in the Homeland Security appropriations bill contingent on a favorable review by the General Accounting Office, which is preparing a report on the program's ability to differentiate between terrorists and innocent people.

Congress has no jurisdiction, however, to oversee another TIA-like program developed by private Boca Raton, Fla., technology firm Seisint Inc. for use by state governments. Seisint's "Matrix" program is essentially a scaled-down version of TIA that uses data mining to establish links between people and patterns of events. The Washington Post reported that the system would be
able to "find the name and address of every brown-haired owner of a red Ford pickup truck in a 20-mile radius of a suspicious event." But precisely because of the program's near-omniscient power, even Florida's special agent in charge of statewide intelligence is worried. As he told the Post, "It's scary. It could be abused. I mean, I can call up
everything about you, your pictures and pictures of your neighbors."

Matrix has been in use in Florida for more than a year and in August, the Department of Homeland Security announced plans to incorporate nearly a dozen states into the program. Matrix is the creation of former alleged drug smuggler Hank Asher, who, according to the Post, boasted to Florida police officials in the wake of the September 11 attacks that he could develop a system to find the hijackers and any terrorists who might strike in the future. Since Asher initiated the program, Seisint, of which he is CEO, has received a $4 million grant from the Department of Justice and an $8 million pledge from the Department of Homeland Security, a signal of the federal government's determination to spread the program nationwide. Indeed, as reported by the Post, in a Jan. 26, 2003, speech to the Florida Sheriffs Winter Conference, state Commissioner of Law Enforcement James Moore called Matrix the "first step in developing a national intelligence network."

If deployed throughout the country, a program such as Matrix would give local police officers the same power to snoop through individuals' personal records and analyze data that Pentagon anti-terrorrism experts would have enjoyed with TIA. According to Matrix's Web site, this would mean prying into an individual's criminal history, driverís license data, vehicle-registration records and incarceration records, including digitized photographs, "with significant amounts of public data record entries." Because the site does not specify which type of "public data record entries" are to be searched, there is no assurance against the investigation of magazine subscriptions, library records and group affiliations -- political, religious or otherwise.

Meanwhile, the National Science Foundation and the Shoah Foundation are developing a technology that would give users the power to search through large recordings of speech in any of 32 languages to instantly find a given word or phrase. To develop this technology, called Multilingual Access to Large Spoken Archives (MALACH), the NSF earmarked $7.5 million in 2001 to Shoah and a team of university computer scientists. Today, MALACH is one of the most ambitious and highly funded programs in the Networking and Information Technology Research and Development Program -- a group with a $5 billion annual budget that NSF administers, along with agencies like the Defense Advanced Research Projects Agency (DARPA) and the NSA -- to create new indexing, information-retrieval and data-mining technology. (NSF's MALACH administrator, Saul Greenspan, declined to be interviewed for this article.)

According to press releases from the Shoah Foundation and the University of Maryland Institute for Advanced Computational Studies (UMIACS), which is spearheading MALACH's research, the impetus for developing MALACH came from Shoah, which was struggling to index 116,000 hours of digitally archived videotape testimony from more than 51,000 Holocaust survivors (many of whom are still alive). So
far, at its Los Angeles museum, Shoah has managed to index only 4,000 individual testimonies at a whopping cost of $8 million. An audio search engine like MALACH would make the indexing process a walk in the park, which is why Shoah turned over its entire archive of Holocaust testimonies to MALACH's research team for use as a data set to test the technology.

However, with such a large grant from the NSF, it would be naive to assume that MALACH is an altruistic gift to Shoah. Indeed, a UMIACS press release acknowledges that "this technology will produce significant impact, both through improved access to
our cultural heritage and through the application of the techniques that we will develop to other important problems." Whether those "other important problems" include terrorism does not concern Sam Gustman, Shoah's technology director and a former information-retrieval specialist from the Army Corps of Engineers. "If the results of the National Science Foundation project are used by other projects, it's public," Gustman told me. "Our goal is to help the [NSF] further the state-of-the-art technology. Now, if someone uses that [technology] for something else, well, that's the effect of working on something in the public domain."

One of MALACH's lead researchers is Douglas Oard of the University of Maryland, an information-retrieval and automated-translation specialist who is also an expert in high-tech counterterrorism applications. His expertise earned him a grant from DARPA's Information Awareness Office to develop an Arabic and Chinese automatic translation program called TIDES (Translingual Information Detection, Extraction and Summarization) which was to be a major component of TIA. Oard also participated in a recent DARPA project to

develop tools
to translate Hindi into English. And he has taught a seminar at the University of Maryland called "Information Technology and the War on Terror."

Asked about MALACH's origins, Oard says the program grew out of his earlier work for DARPA developing TIA. As he told me, MALACH "takes some of the research from TIDES, the DARPA program, and it applies it to a National Science Foundation program." Oard would not address MALACH's potential application to the war on terrorism directly, but he did make clear that it could have a broad societal impact. "If you could [implement MALACH], it would change a lot of things -- and we're making very good progress -- about how we do things in our society," Oard said. "Recording conversations is not hard to do, but using the recorded conversation is extremely difficult. So we're trying to make tools that will help you with that. As soon as we do that, we will have changed a fundamental assumption in our society that speech is ephemeral. This could very well revolutionize the way in which our society treats speech."

Just as data mining gives analysts the power to dig through anything from someone's travel records to their Department of Motor Vehicles files for a valuable piece of information, MALACH would allow them to do the same with recorded speech. Considering that thousands of hours of speech are recorded each week in dozens of languages through surveillance satellites and wiretaps by agencies like the NSA, the speed and convenience that MALACH would afford investigators in searching for specific words and phrases is likely to lower the threshold on government snooping.

Oard is surprisingly frank on the question of whether the technology he is developing provides cause for concern. "It's not worth worrying about as a developer," he said. "As a member of society it's very worth worrying about."

Max Blumenthal is a writer living in Los Angeles.