transparent gif

 

Ej inloggad.

Göteborgs universitets publikationer

The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP

Författare och institution:
Stian Rødven Eide (Institutionen för filosofi, lingvistik och vetenskapsteori); Nina Tahmasebi (Institutionen för svenska språket); Lars Borin (Institutionen för svenska språket)
Publicerad i:
Linköping Electronic Conference Proceedings. Digital Humanities 2016. From Digitization to Knowledge 2016: Resources and Methods for Semantic Processing of Digital Works/Texts, July 11, 2016, Krakow, Poland, 126 ( 002 ) s. 8-12
ISBN:
978-91-7685-733-5
ISSN:
1650-3686
E-ISSN:
1650-3740
Publikationstyp:
Konferensbidrag, refereegranskat
Publiceringsår:
2016
Språk:
engelska
Fulltextlänk:
Sammanfattning (abstract):
In this paper we present a dataset of contemporary Swedish containing one billion words. The dataset consists of a wide range of sources, all annotated using a state-of-the-art corpus annotation pipeline, and is intended to be a static and clearly versioned dataset. This will facilitate reproducibility of experiments across institutions and make it easier to compare NLP algorithms on contemporary Swedish. The dataset contains sentences from 1950 to 2015 and has been carefully designed to feature a good mix of genres balanced over each included decade. The sources include literary, journalistic, academic and legal texts, as well as blogs and web forum entries.
Ämne (baseras på Högskoleverkets indelning av forskningsämnen):
NATURVETENSKAP ->
Data- och informationsvetenskap ->
Språkteknologi (språkvetenskaplig databehandling)
Nyckelord:
A One Billion Word Swedish Reference Dataset for NLP
Postens nummer:
238134
Posten skapad:
2016-06-22 15:05
Posten ändrad:
2016-07-01 15:32

Visa i Endnote-format

Göteborgs universitet • Tel. 031-786 0000
© Göteborgs universitet 2007