PTT Corpus Log in Register

Contents


Prerequisite

You have to be a registered user.

Limitations

Maximum request: 20/minute

You will getHTTP Error 429: TOO MANY REQUESTS if you exceed the maximum request.

Authentication

Here we use Python as an example.

1. First, install urllib2 opener


(Code adapted from here)

  1. import urllib2, base64
  2. class PreemptiveBasicAuthHandler(urllib2.HTTPBasicAuthHandler):
  3. def http_request(self, req):
  4. url = req.get_full_url()
  5. realm = None
  6. user, pw = self.passwd.find_user_password(realm, url)
  7. if pw:
  8. raw = "%s:%s" % (user, pw)
  9. auth = 'Basic %s' % base64.b64encode(raw).strip()
  10. req.add_unredirected_header(self.auth_header, auth)
  11. return req
  12. https_request = http_request
  13. api_url = "http://lopen.linguistics.ntu.edu.tw/PTT/api/"
  14. username = "foo"
  15. password = "bar"
  16. auth_handler = PreemptiveBasicAuthHandler()
  17. auth_handler.add_password(realm=None, uri=api_url, user=username, passwd=password)
  18. opener = urllib2.build_opener(auth_handler)
  19. urllib2.install_opener(opener)

Then you can use urllib2.urlopen(api_url) to access any api url.


To test whether you're authenticated, try this:

  1. urllib2.urlopen('http://lopen.linguistics.ntu.edu.tw/PTT/api/test').read()

If authenticated, you'll get {'status':'ok'};
if not, HTTPError: HTTP Error 403: FORBIDDEN will be raised.

Services

  • Authentication test

    • Description
    • Test authentication status
    • Usage
    • /PTT/api/test
  • Article

    • Description
    • Get articles (po文)
    • Usage
    • /PTT/api/article/BOARD/from/START_DATE/to/END_DATE
      Date format: yyyy-mm-dd
    • Limitation
    • You cannot search time range over 180 days.
      For example, if START_DATE is 2001-01-01, END_DATE cannot be greater than 2001-06-30.
    • Example
    • Getting all the posts from joke (就可版) from 2014-01-01 to 2014-05-01
      /PTT/api/article/joke/from/2014-01-01/to/2014-05-01