Update: Links have been added to some of the library's programs and partners.
On Wednesday, the Library of Congress announced it had signed an agreement with the microblogging service Twitter to archive all public tweets sent since the service began in 2006. I spoke with Martha Anderson, the director of the National Digital Information Infrastructure and Preservation Program at the Library of Congress, about the project and how it fits into the library's digital-archiving efforts. She warned me when we got started that her department had a cumbersome name.
That's a very impressive-sounding title.
Well, the name is horrible. It's shortened to NDIIPP. [Laughs.] We hear all sorts of puns and things about that.
What's the best?
That with digital preservation, we are in deep.
So who came to you with the request, or the idea about Twitter?
Twitter approached us. They were looking around; they are a small business -- which happens, quite often. Businesses cannot afford to sustain all the content they create over the life of the business. And Twitter hadn't reached that point yet, but they were aware of the need to sustain the content someway.
So they began to look around for a strategy for conserving that content in the long term. They knew we had this program at the library, so they called us and asked if we were interested in the Twitter archive.
We do a collection for every Supreme Court nominee -- Web sites and blogs and all sorts of things. Well, one of the things they asked us to collect were tweets for the nomination of Justice Sotomayor. So that was the first indication we had that our selection officials were interested in Twitter.
Correct me if I have this wrong, but in the past, you've done your Web archiving on a subject basis, and this is the first time you're grabbing an entire type of content off the Web?
Exactly. And that's the significance of this. Yesterday [Wednesday, when the agreement was announced], one of my staff came in to tell me that people were saying this was a change from static to streaming. This is first time [on the web] we're looking at a whole corpus of material from a source.
And I think personally, this is me, don't quote me as saying this from the library, as librarians we need to think more about our relationships to content creators, content-generating activities, in a way we used to think about things with publishers -- we would get a relationship to a publisher through copyright, or that sort of thing. Now, the information base is different, and we really need to work on those kinds of relationships.
Is there anything analogous in Library of Congress history?
Well, the library is accustomed, with analog materials, to collecting everything from a creator -- we have in our prints and photograph division all the output from the Department of Interior's historic American buildings survey. It's a huge record of American architecture.
A lot of time we will get all the negatives and works of a photographer. So we're used to a mass of things, rather than a selection in the analogue world. This is our first foray into doing this in the digital world.
When do you start?
The agreement has been signed, but we still have a lot of technical details to work out -- how we'll technically transfer it, and when. There's a built in six-month window, so we don't have the live Twitter archive at any given time. There is a window for people if they want to delete their tweets, things like that.
There's a built-in lag? Yes, so once the transfer is complete, if a researcher comes here, we'll let them know that it's 2006 till six months prior. And there'll be a rolling period of transfers after that.
Can individuals choose to opt their tweets out of it?
You know, I don't know. I think that's a question for Twitter. There's several questions about that which they are still working out. We asked them to deal with the users; the library doesn't want to mediate that.
What about user information? Have you any thoughts about whether you're going to keep that or strip that out? Obviously, that gives a lot of context for a tweet.
It does. And I think that's one of the big issues for us to understand in terms of privacy. And there's a lot of work going on, especially over at [the National Institutes of Health] about how to anonymize data and still make it useful. We're really big on partnering with people to learn what they're learning, so I think that's an area we'll look into. In serving it, what can we do to make it useful to research but not identify personal information?
Is the plan to keep all tweets, forever?
Nothing is forever! I think this is a real learning opportunity. We're embarking on this with the idea that what we receive, we will keep for the long term. That's about the best we can say.
How much will it cost?
Well, it's a gift; we didn't pay for it. But it will be the cost of storing what is, right now, around 5 terabytes, and the staff effort of maybe one full-time person over the years.
So there could be a Twitter Librarian!
Yes, there could be. But, you know, in general, people at the library don't work on one thing all day long. So it's probably, all told, work [that] adds up to one person's job, but it takes all sorts of expertise that's not usually held in one body.
One complaint with Twitter is that it's difficult to follow a conversation that will thread through various people's replies, or even that a particular event will not always have an agreed upon hash tag -- take Sotomayor. She could be #sotomayor or #sotomayornomination. Do you have any plans to put order to that, to help researchers?
We have a partnership with Stanford University, a bunch of very bright mathematical grad students who have been helping us understand how to mine even our digital collections here. We hope to put them to work building tools to help people make order out of it.
They've done some really interesting work for us on these digitized reports from the [Work Progress Administration] during the Great Depression. They were personal narratives -- people went out and interviewed people all over the country. It's in English, but it's colloquial sometimes. They've helped us get into it and make sense of it, because full-text searching doesn't always do the trick.
Do you think you'll have to create a dictionary of slang for people to understand our tweets?
I don't know! I was thinking that myself this morning. OK, when you read a Twitter fall, there's a whole lot going on that you have to decipher there. Maybe there's someone out there who would like to do that; use the archive to create a dictionary.
So, this goes back a bit to the privacy concern; you said you're leaving up to Twitter the question of user information?
Yes, or what they give to us.
But what about people using the collection in the future? Let's say I run for office in 20 years, will someone be able to come read all my tweets and find everything salty I've said?
Well, whether they use it against you or not is another matter, but that's what people do now. People use libraries to find out information about candidates, about public figures. But very likely, sure. A legitimate researcher, and it could be someone working for a political candidate.
Libraries don't censor. And how [people] use the information is not something we police. But I'm not sure how soon we'll make the tweets available. There could be an embargo of several years, just to give a gap between the current environment.
Do people need to come into the Library to see the digital archive, or is it accessible online?
Some of our collections are available online. There are some materials, in the digital archive, that we have to ask permission to make publically available, and if we are not granted permission we can collect it, but we can't make it available through our Web site. You can see on our Web site that we've collected, but you can't see the actual item unless you come into the library.
Do you have a favorite tweet?
There's one I saw yesterday-- "Regarding Library of Congress plan to archive tweets, if journalism is 1st draft of history, is #Twitter the doodles in the margins? :)" I just thought that was cool.