Beliebte Suchanfragen

Cloud Native

DevOps

IT-Security

Agile Methoden

Java

//

How to use Wikipedia’s full dump as corpus for text classification with NLTK

26.3.2013 | 1 minutes of reading time

Wikipedia is not only a never ending rabbit hole of information. You start with an article on a topic you want to know about, and you end up hours later with an article that has nothing to do with the original topic you’ve looked up. And all the time, you’ve been just clicking your way from one article to another.

But from a different perspective, Wikipedia is probably the biggest crowd-sourced information platform with a built-in review process and as many languages as its users want it to be (despite the fact that, together with Google, it has almost completely ousted printed encyclopaedias). So if this is not Big Data, then what is (pardon my sarcasm)?

And what is the most important part for this tiny post: Wikipedia comes with a more or less consistently maintained categorisation. Categories plus text itself are classes in natural language processing (NLP). So I just thought: why not use Wikipedia for text classification? So I ended up with an implementation of a natural language processing corpus based on Wikipedia’s full article dump, using groups of categories as classes and anti-classes. It can be used for whatever text you want to classify, of course as long as you follow Wikipedia’s terms of use and accept the categorisation and article quality. If you don’t, then, well, contribute and improve the quality like others do.

The whole code including a step by step usage instructions is out on GitHub: https://github.com/pavlobaron/wpcorpus . Any constructive feedback and help are welcome.

share post

Likes

0

//

More articles in this subject area

Discover exciting further topics and let the codecentric world inspire you.

//

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.