What is TS Corpus?
TS Corpus is a Turkish Corpus project. TS Corpus is a general-purpose corpus containing 491 million POSTagged tokens (491,360,398 million). TS Corpus is a tagged corpus. TS Corpus aims to combine former Turkish computational linguistics studies and other corpus linguistics studies from world and form a "usable" and "accesable" Turkish Corpus as an end-user product.
TS Corpus is the first PosTagged Turkish Corpus ever build. The first version of the corpus was published in March 2012 and the second version in August 2012. Studies on developing version 2.1 is still in progress.
TS Tokenizer & PosTagger
In order to build a PosTagged corpus, each input text should be converted into WPL (word per line) format. TS Tokenizer is a Turkish tokenizer that produceses WPL outputs from input texts.
TS tokenizer can also catch smileys, internet specific usages and misused punctuations. Online demo of TS Turkish tokenizer is available at:
The PosTagger uses the input coming from tokenizer and produces output useable with CWB or similer tools. This Turkish PosTagger uses a new tag set including smileys (emoticons), misspelling (YY), intenet slang (intSlang), words with emphasis (intEmphasis) etc. The online demo is availble at:
TS Corpus - TweetS
TS Corpus presents a new corpus. TS TweetS. The corpus consists of 1 million tweets (13+ million tokens). The corpus serves the full features of art-of-state CWB andCQP. Besides, each tweet is shown individually in result screen for better discrimination of tweets
This version is an early try-out for text processing tools which are still under development. A developed version,including more of TS TweetS will be ready soon.
TS TweetS Corpus is the first Turkish Twitter corpus with POSTags and allows users to make queries using these tags within pre-defined classification of the year that the tweets had been tweeted.
This version is an early try-out for text processing tools which are still under development. A developed version,including more of TS TweetS will be ready soon
TS Gezi Corpus
TS Gezi Corpus is a specialized corpus, composed from about three thousand news from Turkish and foreign press, about Gezi Park Protests. The corpus is also pos-tagged both for Turkish and English news.
TS Corpus Wikipedia -Beta- is a PosTagged Turkish corpus that is composed of Turkish Wikipedia Pages. TS Corpus Wikipedia -Beta- serves the capabilities of CWB and CQP. All the features that TS Corpus serves is also served by TS Corpus Wikipedia -Beta-.
Further information and documentation will be on tscorpus.com on the following days. TS Corpus SignUp&Login page to sign up.
TS Corpus Search Engine
In order to crawl "big data" for next version of TS Corpus (TS Corpus version 3) a new infrastructure had set up on TS Corpus server. A part of this framework is a search engine. TS Corpus search engine uses Apache Nutch, Apache Solr, Apache Lucene and CrawlAnywhere.
TS Corpus Search Engine now has over 1 million web pages indexed and each hour this number is rapidly increasing. Click the link below for a test drive.
TS Corpus links on Internet
By releasing TS Corpus version 2, some noteable websites had linked TS Corpus on their websites. Some are:
CWB (Corpus WorkBench) Official Website
John Hopkins University (USA)
Michigan University Library Database (USA)
North Carolina University (USA)
Washington University (USA)
George Mason University (USA)
TS Corpus at a Glance
- TS Corpus is the largest "online available" Turkish Corpus ever build
- TS Corpus is the first Turkish corpus whose data is POStagged
- TS Corpus is the first Turkish corpus whose data is morphologically tagged
- TS Corpus is the first Turkish corpus that has tagged lemmas and that you can search by using lemmas
- TS Corpus is the first Turkish corpus that can be reached on-line
- TS Corpus is the first Turkish corpus that can present 7 different statistical information to the user
- TS Corpus is the first Turkish corpus that has been produced on CWB-CQP substructure
- TS Corpus is the first Turkish corpus that enables the users save the results in different file formats
- TS Corpus is the first Turkish corpus that has been presented freely available
Dear TS Corpus users. TS Corpus has moved to its "new user interface". The new interface has new features like:
-- Highlighting lines at result screen
-- Classy display of Query History and Saved Queries screens
-- New Navigation bar with drop-down menus
-- Standard Queries are now triggered by Enter/Return key
The new interface also serves for small displays like smart phones or tablet pc's. The main TS Corpus page now fits on displays as small as 3.2 inches.
As the interface stands on CSS3 it's compatible with all modern browsers like Firefox, Opera, Safari, Iceweasel etc.
Hope you like it...
TS Corpus is now on New Server
Until today TS Corpus had published on a limited infastructure. Also due to power cuts or ISP problems TS Corpus had reached only 92% uptime.
Now, TS Corpus moved to a professional level. The corpus is now on OVH, the greatest data center in Europe. TS Corpus is now serving with a load-balanced database structure on a professional data-center server with 100Mbps internet connection and trustable, uninterrupted service.
With it's new professional infrastructure TS Corpus now can operate your queries 40% faster and serve results 5 times faster.