Poynter Online
Go


Top Story

Paying for the News: Five Seeds for the Future of Journalism
Most Recent Articles
Most E-mailed
Recent Comments
Recent Tags
Community Activity

Poynter Training
Poynter Seminars
Small, in-person training experiences.
News University
Today's most popular courses on NewsU, Poynter's e-learning site for journalists.
Webinars
Our online classroom is just a click away. Learn more.
All Webinars

E-Media Tidbits

Home > E-Media Tidbits
Tools: Text Sizeor, Print, RSSRSS, Subscribe via e-mail
Amy Gahran
A group weblog by the sharpest minds in online media
PoynterGroups.
Find and join conversations about E-Media Tidbits or Online & Multimedia.


Text Mining the Congressional Record
Posted by Amy Gahran 11:07 AM

Friday I posted about an intriguing research project which "topic mined" 330,000 archived New York Times articles. I wondered whether it would be possible to "tweak this software to comb through the Federal Register, Congressional Record, or the Thomas legislative database to quickly locate all the buried riders and clauses on particular issues, regardless of how cryptically they're phrased... Well, I can dream..."

It looks like at least part of that dream might be starting to come true.

On Aug. 3, Ars Technica reported that a group of political science researchers from several U.S. universities have conducted a similar automated topic-mining effort (pdf academic paper) on three years' worth of the Congressional Record.

According to Ars Technica: "The computer was able to group speeches into topics, even when those speeches did not feature certain usual keywords. ...Once the computer has done its statistical analysis and grouped speeches into topic clusters, researchers then looked at a few speeches from each cluster and assigned a name to it ("education" or "terrorism," for instance). Once that was done, interesting questions could be answered: ...How do elected leaders distribute their attention? Under what circumstances do leaders push or follow public attention to an issue? Is debate on most issues incremental or explosive?"

Why hasn't this been done before? Well -- not surprisingly to those of us who've covered federal politics and legislation -- there's simply too much content to analyze manually. The documents involved are voluminous, convoluted, cryptic, and arcanely and inadequately cross referenced. The public statements are bloated, slippery, and laden with code phrases which morph faster than street-drug slang.

In short, the discourse of D.C. is largely impenetrable -- probably deliberately so. This poses a significant obstacle to timely public access to relevant information, and thus to journalism and democracy. Effective automated text or topic mining of that discourse could help us help our audiences be more informed and active citizens.

Ars Technica observes: "With most of the world's information stored in written, not numerical, form, this sort of text mining could be one of the most exciting research areas in the next decade. Stay tuned." One place to follow developments in this field is TextMining.org.

(Thanks to Matthew Waite of the St. Petersburg Times for the tip.)

Tools:
Comment, e-mail, Permalink, Share
Recent Comments:
Not really text mining The project Amy describes is really just an effort to... More.
Read All Comments (1 comments)
Username
Password
New User? Signup Now
Poynter Careers