The Internet Filter Farce

W hat if the baseball could repair the window?” reads the headline of a recent ad for myCIO.com. The copy continues: “The Internet caused the problem. It’s only fitting it should also provide the solution.” As it happens, the advertiser is offering remote management of network security. But the slogan would serve just as well for dozens of other electronic products and services that promise to address the manifold anxieties that the Internet gives rise to–anxieties about hackers, threats to privacy, spam, rumors, commercialism, pornography, fraud, lost work time, or simply the difficulty of finding your way around cyberspace. For every article raising the alarm about one or another of these problems, there’s a clutch of software engineers sitting in a loft somewhere trying to turn the concern into a market opportunity.

It’s an understandable response, given the remarkable achievements of the technology and the hype that accompanies every new innovation. But it can also lead to misguided or even irresponsible decisions, as people naively trust the technology to right its own wrongs. It’s one of the more dangerous guiding principles of the new economy: The remedy for the abuse of digital technology is more digital technology.

The problem is nowhere more evident than in the frenzy to equip homes, schools, libraries, and workplaces with blocking technology–the programs described as “content filtering software” by their makers and as “censorware” by their critics. They go by suggestive names like CYBERsitter, SafeClick, Cyber Patrol, NetNanny, SurfWatch, and I-Gear. It’s a good business to be in right now. A recent report by the research firm IDC estimates that the content filtering market exceeds $150 million a year right now and will reach $1.3 billion by 2003. Parents have been buying the software to protect their children, and search engines and Internet service providers (ISPs) have been offering blocking as a subscriber option. Corporations have been using the software to block employees’ access to pornography (often citing the threat of sexual harassment charges) or, more generally, to restrict access to any non-work-related Web sites. Schools and libraries have been installing the software, sometimes reluctantly, in response to state and local laws that require its use, and federal legislation mandating filters is in the offing.

One reason for the enthusiasm about filters is that they can be seen as a benign alternative to legislative restrictions on speech or access. Such laws generally prove to be unconstitutional–like the Communications Decency Act, which was overturned in 1997. [See Marjorie Heins, “Screening Out Sex: Kids, Computers, and the New Censors,” TAP, July-August 1998.] In fact, when the Third Circuit Court suspended enforcement of the 1998 Child Online Protection Act in February 1999, it cited filters as a less restrictive alternative. Advocates of filters argue that since the software is a commercial product that people adopt voluntarily, questions of censorship can’t arise. As the director of one service provider that uses the software put it, “The First Amendment is not concerned with the capricious acts of individuals but rather with … the danger posed by the enormous power wielded by the federal government.”

Even so, filters have their critics, particularly among civil libertarians and librarians. They argue that requiring filters in schools or libraries can itself be a form of censorship, even if the lists of sites they block are compiled by private companies–a view that was supported in 1998 by a federal district court in Virginia. Filtering advocates have responded that First Amendment concerns don’t override the custodial responsibilities of the school or library. In the words of South Carolina Attorney General Charlie Condon, “A public library can constitutionally filter filth from the eyes of children.” But “filth” is in the eye of the beholder, and all filters go well beyond blocking hard-core porn, both inadvertently and by design.

Of course, First Amendment considerations aren’t relevant when individuals or private organizations use filters. But there are other reasons why filters might make us uneasy. In corporate settings, the use of filters is part of a growing tendency to restrict employee privacy, along with e-mail monitoring and other forms of electronic surveillance. These steps may be legal if workers are appropriately notified in advance, but they’re a poor substitute for more direct evaluations of employees’ productivity. And they can intensify workers’ disaffection and alienation, an effect that’s exacerbated by the heavy-handed condescension with which employers often try to sweeten the pill. Take the Dilbertesque explanation that one Fortune 500 company offered its employees when it began filtering their Web use a while ago: “Imagine yourself surfing the Web and you come across a link that says, ‘Click here for cool stuff.’ You click on the link and are suddenly presented with a site that has less than suitable pictures for a business environment. If this has ever happened to you, we have good news… .”

Parental restrictions on children’s access to information are clearly in a different category. As Al Gore frequently puts it, “Blocking your own child’s access to offensive speech is not censorship–it’s parenting.” The problem is that parents who buy a commercial filtering program have no way of knowing exactly what speech it blocks, and the software companies are doing all they can to keep their customers in ignorance. The lists of sites blocked by most of the filters are kept encrypted, as are the keyword algorithms they use to block additional sites. And when free-speech advocates have hacked the filters and posted lists of the sites they block, the companies have gotten the courts to suppress the postings on the grounds that they violate provisions of the Digital Millennium Copyright Act. (Those provisions were relaxed last October to allow some circumvention of encryption mechanisms for purposes of finding out what sites filters are blocking.) All of this has put the censorware companies in a position that would delight any other business: Not only does government mandate the use of their products, it also enforces their right to conceal from the public what exactly those products do.

The software companies justify their secrecy by citing the need to protect their intellectual property and by arguing that publication of the lists of blocked sites would enable children to bypass the filters and access inappropriate materials. But neither argument is very plausible. Other companies manage to protect their rights to the databases they compile without keeping them secret, and it would be easy enough to make the lists publicly available without making them accessible to every schoolchild. The real danger for the software companies in making the lists public is that people would rapidly see just how inadequate their software is. That’s the real scandal of the filtering controversy: The technology doesn’t–and can’t–work as promised.

Visit TAP Online’s Special Segment on Children and Families

Why Filters Fail

Filters come in different forms. Some are implemented “upstream,” at the level of proxy servers that control access for whole schools, libraries, or businesses; others are implemented “downstream” at individual workstations or PCs. But they all accomplish their filtering in pretty much the same way. The software companies start by compiling “control lists” of the addresses of unacceptable sites. Then, since these lists inevitably miss large numbers of offensive sites, they add automatic keyword filters to block additional sites that contain certain words and phrases. Most of them permit customers to specify the categories of sites they want to block–for example, “sex acts,” “perversions,” “hate speech,” and “drug advocacy,” not to mention additional categories like “job search,” “games,” and “dating,” for the benefit of employers. (SmartFilter even adds a category of “worthless sites” that includes things like pages full of cat stories.) And most keep logs of use and make provision for automatic notification of parents or supervisors or system administrators when someone tries to access an excluded site.

The inadequacies of the systems are implicit in this basic architecture. In compiling their control lists, software makers have a natural interest in drawing the circle very broadly, so as to block sites that might be objectionable to one or another segment of their market, even if they wouldn’t be considered pornographic or offensive by any reasonable standard. Take safe-sex information. SurfWatch has blocked safe-sex information pages at Washington University, the University of Illinois Health Center, and the Allegheny University Hospitals, and Cyber Patrol has blocked the HIV/AIDS information page of the Journal of the American Medical Association and the site of Planned Parenthood. SmartFilter blocks the safe-sex page of the Johns Hopkins Medical School research group on sexually transmitted diseases. The filters have also blocked numerous sites associated with feminism or gay and lesbian rights. Both I-Gear and CYBERsitter have blocked the site of the National Organization for Women (CYBERsitter cites the “lesbian bias” of the group). I-Gear has blocked the Harvard Gay and Lesbian Caucus, BESS has blocked the Gay and Lesbian Prisoner Project, and NetNanny has blocked Internet discussion groups on AIDS and feminism.

There’s more: Many filters block Web privacy sites and sites that facilitate anonymous Web access. And filter makers routinely use their control lists to block sites critical of their products. SafeSurf has blocked the site of the Wisconsin chapter of the American Civil Liberties Union. I-Gear has blocked the site of the Electronic Privacy Information Center, and SafeClick has blocked some of the testimony at hearings on filters held by the congressionally appointed Commission on Online Child Protection. That must be the dream of every corporate publicist–to be able to prevent your customers from reading any negative comments about your products.

When we pass from control lists to keyword filters, we go from the outrageous to the ridiculous. Sites have been blocked simply because they contain the words witch, pussycat, or button. A government physics archive was blocked because its URL (uniform resource locator, or Web site address) began with the letters XXX. Keyword filters have blocked the sites of Congressman Dick Armey and Beaver College in Pennsylvania. What these anecdotes don’t show, though, is just how extensive the overblocking of keywords is. The censorware companies like to claim that their accuracy is extremely high, citing library studies showing that inappropriate blocks constitute a tiny proportion of all Web accesses. For example, Secure Computing, the manufacturer of SmartFilter, claims that a Utah study showed that blocking of miscategorized pages by its program constituted only .0006 percent of all Web access attempts–a figure cited by Arizona Senator John McCain in support of a mandatory-filtering proposal he is sponsoring. But that’s a highly misleading way of measuring overblocking: Even if a filter blocked every single site on the Web that mentioned safe sex or breast cancer, the total number of incorrectly blocked accesses would be tiny relative to the huge number of accesses to sites like Amazon.com and Yahoo! By analogy, imagine a police force that arrests every Arab American in town on an antiterrorism sweep, then claims that its false arrest rate is under 1 percent, since 99 percent of the total population was not detained.

The only appropriate way to evaluate the filters is to ask what proportion of the sites they block as pornographic or offensive are in fact correctly categorized. And by this standard, the filters fare very poorly. In one recent study, 1,000 randomly chosen addresses in the dot-com domain were submitted to the SurfWatch filter. Of the sites it blocked as “sexually explicit,” more than four out of five were misclassified–for example, the sites of an antiques dealer in Wales, a Maryland limo service, and a storage company in California. In another recent study, the free-speech advocate who runs Peacefire.org hacked the Symantec Corporation’s I-Gear filter and published the list of the first 50 blocked URLs in the dot-edu domain. Fully 76 percent of these pages were errors or misclassifications, most of them completely devoid of sexual content of any kind. The program blocked a diagram of a milk pasteurization system with accompanying text entirely in Portuguese and two long sections of Edward Gibbon’s Decline and Fall of the Roman Empire. It also blocked a tract by the seventeenth-century theologian John Owen entitled “Justification of Faith through the Righteousness of Christ” and a page that contained nothing but a passage in Latin from Saint Augustine’s Confessions. (Intriguingly, it is a passage in which the bishop chastises himself for his impure thoughts–but the filter was doubtless triggered simply by the presence of the Latin preposition cum.)

This overblocking is an inevitable consequence of the keyword approach. The fact is, it’s impossible to single out porn sites reliably simply by the words they use. Go to Disney’s Go.com, turn on the GoGuardian filter, and do a search on sex; you will get no hits at all. Then turn it off and discover what you were missing: not just porn pages, but the text of the Scientific American article “Bonobo Sex and Society,” the pages on sex discrimination of the Australian Equal Opportunity Commission, and the Michigan Sex Offender Registry. It’s true that filters can fare a bit better by looking for combinations of keywords and by doing some statistical analysis of content. But few of them appear to use sophisticated techniques, probably because any effort to reduce the number of false alarms will inevitably reduce the number of genuine porn or hate sites that they block as well.

Filtering advocates have argued that blocked Web pages on Saint Augustine or Gibbon are simply regrettable collateral casualties in the war against online porn and racism: Better, after all, to block some inoffensive sites than to allow some offensive ones to get through. As a field director of the profilter American Family Association puts it: “Filters are workable. We’d rather err on the side of caution instead of being too liberal.” And there’s no question that the software companies have deliberately kept their filters overly broad. It isn’t just that they use overinclusive keywords like sex to screen out pages; they have also blocked whole servers or even whole ISPs when any one of their pages is flagged for objectionable content. Cyber Patrol blocked the entire Deja News (now Deja.com) site, which archives thousands of discussion groups on everything from commercial mortgages to archaeology, and all of the 1.4 million pages on the Web-hosting service Tripod.com. And a number of filters block pages containing banner ads that appear to contain links to inappropriate sites. The software makers reason, probably correctly, that their average customer is more likely to be concerned about porn sites that slip through the screen than about the blocking of useful sites–particularly since customers are usually ignorant about how frequently the latter occurs.

Yet even with the most overly restrictive filtering mechanisms, the programs don’t do an adequate job of blocking porn and other offensive materials. You wouldn’t know this from the claims of the software makers. CYBERsitter guarantees that its software blocks “more than 97 percent of all objectionable content” (though it doesn’t define “unobjectionable”), and SurfWatch claims that it can “shield users from 90-95% of the explicit material on the ‘net.” But neither company says how it came up with these figures, and independent tests suggest that they are wildly exaggerated. A few years ago Consumer Reports tested the four most common filtering programs against a list of sites that its investigators judged clearly unsuitable for young children. SurfWatch blocked 82 percent of the sites, the highest score of the group, and CYBERsitter blocked only 63 percent (both programs performed much better than NetNanny, which blocked none at all). Another study showed that the filter BESS failed to screen out more than 275 of the sites identified as pornographic on Yahoo!–a singularly easy group to block, since they’ve already been located and labeled.

A study at the Annenberg School of Communications suggests that filters are even worse at identifying violent content than they are at catching pornography. That result is not surprising. Porn sites often give themselves away with genre-specific keywords like XXX or cum, and this makes for relatively efficient filtering. But the only way to block a large number of violent sites would be to use very general keywords that inevitably lead to the overblocking of thousands of useful or informative sites in the process. Do a Web search on “torture+domination,” for example, and you will find a number of disturbingly lurid sites; but you will also find a report from the Canadian Centre for Victims of Torture, the summary of a human rights conference at the University of Chicago, and several pages documenting the horrors of the Holocaust.

What’s more, these studies almost certainly underestimate just how leaky the filters are; and the proportion of offensive content that the filters miss will inevitably grow as the Web swells. For one thing, the filters simply can’t keep up with the size of the Web and the vast amount of objectionable material it contains. In a 1999 article in Nature, Steve Lawrence and C. Lee Giles found that 1.5 percent of indexable Web servers contained pornographic material, a proportion that would translate to around 80,000 servers at the present size of the Web. Since a single server can host a number of sites, a highly conservative estimate would be 150,000 to 200,000 sites that contain pornographic material. These sites wink on and off and change addresses frequently: The archiving service Alexa.com estimates that the average Web site has a life of 75 days. To locate and flag all this content, a filtering service would have to do periodic sweeps of the entire publicly accessible Web, which as of late 2000 contained in the neighborhood of 1.5 billion pages. That’s more than anyone could possibly track: Even with the extensive resources that search engines like AltaVista and Inktomi have at their disposal, none of them indexes more than 15 percent of the total, and all of them taken together index less than half of it. And even if you could find all the Web pages, a filtering company would require a full-time staff of more than 2,000 people just to check out the two million new pages that are added every day.

Figures like these make a mockery of the filter makers’ claims that their control lists can offer anything like comprehensive coverage of the Web. The only possible way to get at most of the objectionable content is through keyword filtering. But however broadly it’s applied, that technique misses the large number of porn sites that don’t contain explicit terms that will tip off the filter. And sites that do want to include sexually explicit text can choose among a number of simple ways to circumvent the screens. They can represent the text as an image rather than as a string of characters, for example–a technique that’s used by many people who don’t want their Web page content to be picked up by the “spiders” that crawl the Web compiling the indexes of the search engines. Or they can encode the text in a Java script so it will bypass the filters unnoticed–and do so in such a way that will still allow people to seek them out. At this point, it isn’t clear how much the porn site proprietors have been using expedients like these. But if the use of filters becomes widespread enough to cut significantly into the pornographers’ profits, there is no question that sexual-content providers will become as resourceful at foiling the software as they have already been in gaming the Web search engines so that their sites come up in the first batch of hits.

Alternatives to keyword identification are even less effective. In 1999 Exotrope, a company in Elmira, New York, introduced a system called BAIR (for Basic Artificial Intelligence Routine), which it billed as capable of recognizing pornographic images with 99 percent accuracy, thanks to its use of artificial intelligence and “active information matrices.” The launch was held at a Schenectady middle school and was attended by New York Governor George Pataki, who hailed the company as one of the fruits of his administration’s efforts to create new high-tech industry in the Empire State and applauded the product’s usefulness: “You’ll be able to have a computer in any classroom unsupervised, a computer at home where your nine-year-old or 10-year-old disappears hours on end, and we can know and be confident that the information they are accessing is appropriate for someone of their age level.” But parents who install the BAIR system would be well advised to check in on their kids from time to time. When the software was independently tested, it correctly labeled only two-thirds of a set of pornographic images–and mislabeled as pornographic exactly the same proportion of a set of ordinary portrait photos downloaded from AOL personal ads. In the end, BAIR is just a system that can identify flesh tones with less than 70 percent accuracy–about par for the present state of image recognition, and miles short of a system that could reliably tell the difference between stills from Deep Throat and from My Dinner with André.

In a way, though, all of this is beside the point. Even if the filters were capable of achieving the fanciful levels of accuracy their advocates like to claim, it isn’t as if there would be anything like a corresponding reduction in the diffusion of pornography. If for argument’s sake we estimate that there are 150,000 porn sites on the Web, a filter that could screen them out with an accuracy rate of 95 percent would still leave 7,500 available, which is more than enough to satisfy the most assiduous pornophile. Bear in mind that an Internet porn site isn’t like a handgun or a gram of cocaine–or, for that matter, like a brick-and-mortar pornographic bookstore–since a single site can serve an indefinitely large number of users from wherever they log in.

All of this has a familiar ring to linguists who have been working for years to develop tools that can deal with human language in a naturalistic way. Every few years, a new flurry of hype touts a system that has “cracked” this or that aspect of the problem–automatic translation, for example, or realistic question-answering. But when they’re put to the test, the systems never come remotely close to human language capabilities; that goal is not likely to be achieved for decades. It is much easier to reproduce the competence of a chess grandmaster than to reproduce the behavior of the kibitzers around the table.

Filtering software is basically just another system of the same type–except that the techniques it uses are much more primitive than those used by modern translation systems and the like. And the task of filters is far more demanding. While translating a simple sentence or understanding a straight-forward question are tasks that are well within the linguistic capacities of a 10-year-old, distinguishing pornography and hate sites from serious discussions of sex or race requires not just adult linguistic competence but adult judgment. It may be that we know obscene material when we see it, as Justice Potter Stewart said, but it is a daunting matter to teach a machine to make such discriminations–or even the much more blatant distinctions that 10-year-olds delight in grasping.

Language analysis software can be useful, so long as we make allowances for its shortcomings. Automatic translation systems do a wretched job by human standards. But sometimes even a very bad translation can be useful–if you’re simply trying to determine whether a hotel in Paris accepts Visa cards, for example, or whether a scientific paper in Japanese is relevant enough to merit a proper human translation. Software that analyzes language content is fine for making a first pass at sorting incoming corporate e-mail, so long as employees are on hand to clear up its errors. And we can tolerate a fairly low accuracy rate from a natural-language query system like Ask Jeeves; the misunderstood questions and irrelevant responses may be a bother, but they don’t do any real damage.

When it comes to protecting children from offensive content, however, our tolerance for error is much lower. Politicians and administrators may find it convenient to believe in the efficacy of filters so they can reassure parents that the technology allows us to leave children alone in front of computers and, as Governor Pataki put it, “be confident that the information they are accessing is appropriate for someone of their age level.” In the current climate, it’s important to be seen as doing something about the problem of offensive content on the Web. But trusting filters to protect schoolchildren from objectionable content is simply irresponsible: It’s like entrusting airport security to a metal detector that misses 40 percent of the concealed handguns and beeps at a third of the metal hangers in passengers’ suitcases.

A Place for Filters?

Filter usage will get a big boost if Congress adopts Senator McCain’s proposed legislation making filters obligatory in institutions that receive the “e-rate” subsidies established by the Telecommunications Act of 1996–just one of several proposals along these lines. It’s clearly an issue with wide appeal: In a 2000 survey sponsored by the Ford Foundation, 92 percent of respondents favored having schools use filters to block pornography, and 79 percent favored their use to block hate speech.

People are right to be troubled by the vast amount of offensive and harmful content on the Web. But there’s reason to be troubled, too, by most of the approaches proposed thus far to deal with it, whether technological, legal, or political. Legislative restrictions on the distribution of obscene or indecent content have generally proved to be unconstitutional and, in any case, are extremely difficult to enforce. The Web is not a place that lends itself to police sweeps: Sites change their addresses frequently, and the people responsible for them are hard to find and identify. Then, too, a large proportion of offensive and pornographic sites are located at servers outside U.S. jurisdictions; setting up a Web site in Thailand is even easier than opening a bank account in the Bahamas. So it’s unrealistic to expect too much from efforts at supply-side controls.

The other solutions that have been proposed also have their limitations. Self-rating policies for Web sites raise issues of practicality and enforcement [see Joshua Micah Marshall, “Will Free Speech Get Tangled in the Net?” TAP, JanuaryFebruary 1998]. “Whitelist” or “greenspace” domains restricted to prescreened age-appropriate material may be useful for younger children, but they inevitably exclude huge amounts of valuable information.

There are limits, too, to what we can expect from adult supervisors. “Tap on the shoulder” policies put librarians in the dubious position of having to police their patrons’ Web use. What’s more, monitoring policies are likely to inhibit young people from using the Web to find answers to their questions about topics like safe sex, suicide, homosexuality, or other areas of concern that they might have reasons for concealing from parents, teachers, or librarians.

Filters raise even more problems than most of these other solutions do, but in the current political climate, it isn’t likely that they will go away. Still, before we allow anyone to install filters in public institutions, we can ask that the programs be improved and made more selective, and that their makers be held accountable to reasonable standards of public disclosure. For starters, the filter companies should be required to list publicly the names of all identifiable organizations, publications, and sites included on their control lists, along with a brief description of the offending content. This is simply a matter of truth in advertising: If a filter is blocking a large number of gay and lesbian sites or safe-sex sites, people have a right to know this in advance. Indeed, it’s hard to see how the use of filters in libraries could be constitutionally defended if the librarians have no way of knowing what point-of-view biases they might incorporate.

Second, the blocking of sites by filters used in public institutions should be subject to judicial or administrative review. At present, sites can be unblocked

This article appears in Issue: Spurious George.

Related

Geoffrey Nunberg

Verify your email