Historizing topic models: A distant reading of topic modeling texts within historical studies
Författare och institution:
Rene Brauer (Institutionen för geovetenskaper); Mats Fridlund (-)
Cultural Research in the Context of "Digital Humanities”: Proceedings of International Conference 3-5 October 2013, s. 152-163
Topic modeling (TM) is a method used within the new ‘digital history’ that represents a data driven methodology that might be closest to fulfilling literary historian Franco Moretti’s promise of making possible ‘distant reading’ of large text quantities. Inspired by this promise, TM has been used for historical studies since the early 2000s and this study provides a survey of the state of the art of TM among historical studies by giving a historical and methodological introduction into the use of TM within historical minded research.
TM’s was first being developed for data mining within natural language processing and machine learning in the 1990s and had as its overwhelming benefit its ability to cover magnitudes more of data as compared to traditional methods. The primary topic model used is the Latent Dirichlet Allocation that allows TM to be used as a search function, a quantitative check of intuition or as a summarization tool for large corpora of texts. Having many competing theories and assumptions that are constantly being challenged and developed TM in itself currently represents a very active area of research within computer science.
The survey of historical texts take its starting point as the first peer-reviewed historical article in 2006 and end point the publication of the firs research monograph in 2013 and identified 23 historical studies employing TM. To provide a general overview of the field the studies were examined using a distant reading quantitative approach and analyzed according to authors’ academic background, gender, academic seniority and country of academic institution; corpora’s type, language, chronology, and geographical focus. The results showed most authors being junior untenured male researchers, primarily affiliated with US-universities and the texts consisting of a substantial number of non-standard online texts. Despite the application within historical studies TM still comes across as a technology driven approach with majority of authors having a background in technical disciplines. Corpora where primarily focused on English texts with a US or global focus and with an emphasis on recent history. All in all TM appear to an emergent rather than established historical methodology.
Ämne (baseras på Högskoleverkets indelning av forskningsämnen):
Historia och arkeologi ->
Språk och litteratur ->
Jämförande språkvetenskap och lingvistik ->
Filosofi, etik och religion ->
Idé- o lärdomshistoria
topic modeling, digital history, digital humanities, historical methodology, Latent Dirichlet Allocation
Visa i Endnote-format