A few weeks ago, I made a microloan to political management student Yaqout from Al Hashmiya, Jordan. Twenty-three other lenders from around the globe did the same, reaching a total of $725. Yaqout will use the money to finance her next semester.
The loan was facilitated by Kiva, a microfinance platform that has been around since 2005. In ten years, hundreds of thousands of borrowers like Yaqout have supported their loan application with a personal story describing their ambitions and plans.
Over the years, Kiva have meticulously collected all these stories. What's more, they made them available for further application development and research. Perfect for a data science project!
Topic modelling as a subgoal of natural language processing
Over the past 2 1/2 weeks, the Metis Data Science Bootcamp introduced Natural Language Processing (= NLP, not to be confused with Neuro-Linguistic Programming). One subgoal of NLP is to make it easier for humans to process and interpret huge amounts of text data.
While most earthlings nowadays routinely use text search (a.k.a. information retrieval), fewer are accustomed to getting (at least an impressionistic) insight into the contents of massive data collections. Yet, over the last 20 years, technologies like latent semantic indexing and more recently probabilistic topic models have been developed to precisely address this need.
Kiva borrower stories at a glance
With the help of gensim, an excellent Python library for topic modelling, we extracted 64 topics from a corpus containing 775 thousand borrower stories. Please refer to our iPython notebook and GitHub repository for details.
The zoomable sunburst graph below offers a bird's-eye view of the Kiva borrower stories. Each of the 64 outer segments represents one core topic, which is expressed as a probability distribution over a vocabulary of 1000 words (excluding stop words and other very frequent words). From these 1000 words, only the seven most prominent ones are shown for each topic. The inner segments represent successive higher-abstract groupings of the core topics.
The breadcrumb arrow at the top left-hand side shows the path of all most prominent words visited, as you traverse the hierarchy from the central circle to the outer rings, and back.
See Hierarchie intro
The size of each segment represents the presence of this topic in the whole collection, i.e. across all borrower stories. In probabilistic topic modelling, each document is indeed seen as having been generated by a probabilistic draw from all topics in the data collection - with some topics being more prominent than others. Otherwise stated: no single document has an exclusive relationship to a single topic, and vice versa.
The colors have no other purpose than to serve visual distinctiveness; they do not carry any meaning as such.