Home

What is TS Corpus?TS Corpus Logo

TS Corpus is a Turkish Corpus project. TS Corpus is a general-purpose corpus containing 491 million POSTagged tokens (491,360,398 million). TS Corpus is a tagged corpus. TS Corpus aims to combine former Turkish computational linguistics studies and other corpus linguistics studies from world and form a "usable" and "accesable" Turkish Corpus as an end-user product.

TS Corpus is the first PosTagged Turkish Corpus ever build. The first version of the corpus was published in March‭ ‬2012‭ ‬and the second version in August‭ ‬2012. Studies on developing version 2.1 is still in progress.

TS Corpus - TweetS

TS TweetS

TS Corpus presents a new corpus. TS TweetS. The corpus consists of 1 million tweets (13+ million tokens).  The corpus serves the full features of art-of-state CWB andCQP. Besides, each tweet is shown individually in result screen for better discrimination of tweets

This version is an early try-out for text processing tools which are still under development. A developed version,including more of TS TweetS will be ready soon.

TS TweetS Corpus is the first Turkish Twitter corpus with POSTags and allows users to make queries using these tags within pre-defined classification of the year that the tweets had been tweeted.

This version is an early try-out for text processing tools which are still under development. A developed version,including more of TS TweetS will be ready soon

Click here for TS TweetS

TS Tokenizer & PosTagger

In order to build a PosTagged corpus, each input text should be converted into WPL (word per line) format. TS Tokenizer is a Turkish tokenizer that produceses WPL outputs from input texts.

TS tokenizer can also catch smileys, internet specific usages and misused punctuations. Online demo of TS Turkish tokenizer is available at:

TS Tokenizer

The PosTagger uses the input coming from tokenizer and produces output useable withh CWB or similer tools. This Turkish PosTagger uses a new tag set including smileys (emoticons), misspelling (YY), intenet slang (intSlang), words with emphasis (intEmphasis) etc. The online demo is availble at:

PosTagger

TS Corpus Wikipedia -Beta-

TS Wikipedia CorpusTS Corpus Wikipedia -Beta- is a PosTagged Turkish corpus that is composed of Turkish Wikipedia Pages. TS Corpus Wikipedia -Beta- serves the capabilities of CWB and CQP. All the features that TS Corpus serves is also served by TS Corpus Wikipedia -Beta-.

 

Further information and documentation will be on tscorpus.com on the following days.  TS Corpus SignUp&Login page to sign up.

Click here for TS Corpus Wikipedia -Beta-

TS Corpus Search Engine

In order to crawl "big data"  for next version of TS Corpus (TS Corpus version 3) a new infrastructure had set up on TS Corpus server. A part of this framework is a search engine. TS Corpus search engine uses Apache Nutch, Apache Solr, Apache Lucene and CrawlAnywhere.

TS Corpus Search Engine now has over 1 million web pages indexed and each hour this number is rapidly increasing. Click the link below for a test drive.

TS Corpus Search Engine


 

TS Corpus at a Glance

  • TS Corpus is the largest Turkish Corpus ever build
  • TS Corpus is the first Turkish corpus whose data is POStagged
  • TS Corpus is the first Turkish corpus whose data is morphologically tagged‭
  • TS Corpus is the first Turkish corpus that has tagged lemmas and that you can search by using lemmas
  • TS Corpus is the first Turkish corpus that can be reached on-line
  • TS Corpus is the first Turkish corpus that can present‭ ‬7‭ ‬different statistical information to the user
  • TS Corpus is the first Turkish corpus that has been produced on CWB-CQP substructure
  • TS Corpus is the first Turkish corpus that enables the users save the results in different file formats
  • TS Corpus is the first Turkish corpus that has been presented freely available

TS Corpus links on Internet

By releasing TS Corpus version 2, some noteable websites had linked TS Corpus on their websites. Some are:

CWB (Corpus WorkBench) Official Website
Linguist List
Michigan University Library Database (USA)
North Carolina University (USA)
Washington University (USA)
George Mason University (USA)
Alphabit.net (Spain)
Wikipedia


TS Corpus on Google

 

New User Interface of TS Corpus

Dear TS Corpus users. TS Corpus has moved to its "new user interface". The new interface has new features like:
-- Highlighting lines at result screen
-- Classy display of Query History and Saved Queries screens
-- New Navigation bar with drop-down menus
-- Standard Queries are now triggered by Enter/Return key
-- etc.

The new interface also serves for small displays like smart phones or tablet pc's. The main TS Corpus page now fits on displays as small as 3.2 inches.
As the interface stands on CSS3 it's compatible with all modern browsers like Firefox, Opera, Safari, Iceweasel etc.
Hope you like it...

TS Corpus is now on New Server

Until today TS Corpus had published on a limited infastructure. Also due to power cuts or ISP problems TS Corpus had reached only 92% uptime.

Now, TS Corpus moved to a professional level. The corpus is now on OVH, the greatest data center in Europe. TS Corpus is now serving with a load-balanced database structure on a professional data-center server with 100Mbps internet connection and trustable, uninterrupted service.

With it's new professional infrastructure TS Corpus now can operate your queries 40% faster and serve results 5 times faster.

TS Corpus Turkish Frequency Lists

TS Corpus frequency lists have published. TS Corpus frequency lists include morphological analyses and possible root of the words for the first time in Turkish.

Click here for TS Corpus frequency lists.

TS Corpus Manual

TS Corpus Result Page Manual is now available.

Click here for details.