Friday I posted about an intriguing research project which "topic mined" 330,000 archived New York Times articles. I wondered whether it would be possible to "tweak this software to comb through the Federal Register, Congressional Record, or the Thomas legislative database to quickly locate all the buried riders and clauses on particular issues, regardless of how cryptically they're phrased... Well, I can dream..."
It looks like at least part of that dream might be starting to come true.
On Aug. 3, Ars Technica reported that a group of political science researchers from several U.S. universities have conducted a similar automated topic-mining effort (pdf academic paper) on three years' worth of the Congressional Record.
According to Ars Technica: "The computer was able to group speeches into topics, even when those speeches did not feature certain usual keywords. ...Once the computer has done its statistical analysis and grouped speeches into topic clusters, researchers then looked at a few speeches from each cluster and assigned a name to it ("education" or "terrorism," for instance). Once that was done, interesting questions could be answered: ...How do elected leaders distribute their attention? Under what circumstances do leaders push or follow public attention to an issue? Is debate on most issues incremental or explosive?"
Why hasn't this been done before? Well -- not surprisingly to those of us who've covered federal politics and legislation -- there's simply too much content to analyze manually. The documents involved are voluminous, convoluted, cryptic, and arcanely and inadequately cross referenced. The public statements are bloated, slippery, and laden with code phrases which morph faster than street-drug slang.
In short, the discourse of D.C. is largely impenetrable -- probably deliberately so. This poses a significant obstacle to timely public access to relevant information, and thus to journalism and democracy. Effective automated text or topic mining of that discourse could help us help our audiences be more informed and active citizens.
Ars Technica observes: "With most of the world's information stored in written, not numerical, form, this sort of text mining could be one of the most exciting research areas in the next decade. Stay tuned." One place to follow developments in this field is TextMining.org.
(Thanks to Matthew Waite of the St. Petersburg Times for the tip.)
The project Amy describes is really just an effort to...