reference:

from helper import *

# ! pip install pandas nltk gensim pyldavis
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Python libraries needed:¶

pandas
nltk
- corpus to be download using nltk.download()
  - stopwords
  - wordnet
gensim
pyldavis

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\iri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\iri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

True

What does the result of a topic model look like?¶

assume you have 5 articles, all five articles are written in only 100 words
- you assume there will be 3 topic among the five articles

A topic model will give you¶

each article belongs to each topic by percentage of similarity

article_topic

each topic contains which words¶

if using tf-idf with NMF to create a simple topic model ### Note here a topic model like LDA will generate topics that
- using sampling techniques and iternations #### so rather than a topic "contains" which word, a list of words "belongs" to which topic is more appropriate

topic_term

Parameters of LDA¶

num_topics
- specify how many topics you would like to extract from the documents

alpha
- document-topic density
  - the greater, the article will be assigned to more topics, vice versa
eta
- topic-word density
  - the greater, each topic will contain more words, vice versa

data preprocessing¶

stopwords: remove general words like I, is, to
punctuation: remove punctuations
lemmatize: reduce related forms of a word to a common base
- e.g.
  - am, are, is -> be
  - car, cars, car's, cars' -> car

from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

stopwords = set(stopwords.words('english'))
punctuation = set(string.punctuation) 
lemmatize = WordNetLemmatizer()

def cleaning(article):
    one = " ".join([i for i in article.lower().split() if i not in stopwords])
    two = "".join(i for i in one if i not in punctuation)
    three = " ".join(lemmatize.lemmatize(i) for i in two.split())
    return three

Data preparation can't be simpler, you only need a list of documents.
- The shorter for each document, the less time will take to complete a topic model.
  - You can start with tweets, reviews of e-Commerce, reviews of movies

data source
- movie reviews
  - http://text-analytics101.rxnlp.com/2011/07/user-review-datasets_20.html

df = pd.read_table('plot.tok.gt9.5000', names=['text'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 1 columns):
text    5000 non-null object
dtypes: object(1)
memory usage: 39.1+ KB

df.head(3)

text = df.applymap(cleaning)['text']
text_list = [i.split() for i in text]
len(text_list)

5000

text_list[0]

[u'movie',
 u'begin',
 u'past',
 u'young',
 u'boy',
 u'named',
 u'sam',
 u'attempt',
 u'save',
 u'celebi',
 u'hunter']

add log for recording the model fitting data while training¶

from time import time
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO,
                   filename='running.log',filemode='w')

All the text documents combined is known as the corpus.¶

To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation.
- LDA model looks for repeating term patterns in the entire DT matrix.
  - Python provides many great libraries for text mining practices,
    - “gensim” is one such clean and beautiful library to handle text data.
      - It is scalable, robust and efficient.
        
        Following code shows how to convert a corpus into a document-term matrix.

build dictonary¶

and save for future use

# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
dictionary = corpora.Dictionary(text_list)
dictionary.save('dictionary.dict')
print dictionary

Dictionary(13417 unique tokens: [u'halebopp', u'yellow', u'narcotic', u'four', u'billing']...)

C:\Program Files\Anaconda2\lib\site-packages\gensim\utils.py:843: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

build corpus¶

and save for future use

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in text_list]
corpora.MmCorpus.serialize('corpus.mm', doc_term_matrix)

print len(doc_term_matrix)
print doc_term_matrix[100]

5000
[(173, 1), (282, 1), (638, 1), (906, 1), (957, 1), (958, 1), (959, 1), (960, 1), (961, 1), (962, 1), (963, 1), (964, 1), (965, 1)]

Running LDA Model¶

Next step is to create an object for LDA model and train it on Document-Term matrix.
- The training also requires few parameters as input which are explained in the above section.
  - The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.

check out the running.log in the home directory while running the model¶

track process

start = time()
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=10, id2word = dictionary, passes=50)
print 'used: {:.2f}s'.format(time()-start)

used: 173.02s

print(ldamodel.print_topics(num_topics=2, num_words=4))

[(0, u'0.009*"one" + 0.006*"get" + 0.005*"film" + 0.005*"turn"'), (9, u'0.008*"love" + 0.006*"make" + 0.006*"friend" + 0.006*"father"')]

for i in ldamodel.print_topics(): 
    for j in i: print j

0
0.009*"one" + 0.006*"get" + 0.005*"film" + 0.005*"turn" + 0.005*"story" + 0.005*"two" + 0.005*"way" + 0.004*"girl" + 0.004*"friend" + 0.004*"life"
1
0.012*"life" + 0.009*"year" + 0.008*"find" + 0.007*"one" + 0.007*"love" + 0.007*"father" + 0.007*"man" + 0.005*"son" + 0.005*"time" + 0.005*"go"
2
0.011*"love" + 0.011*"story" + 0.008*"young" + 0.006*"new" + 0.005*"man" + 0.005*"fall" + 0.005*"come" + 0.004*"find" + 0.004*"girl" + 0.004*"first"
3
0.008*"young" + 0.007*"year" + 0.006*"film" + 0.006*"time" + 0.006*"world" + 0.005*"new" + 0.004*"city" + 0.004*"york" + 0.003*"tell" + 0.003*"life"
4
0.007*"new" + 0.006*"two" + 0.005*"get" + 0.004*"work" + 0.004*"school" + 0.004*"case" + 0.003*"hand" + 0.003*"kill" + 0.003*"around" + 0.003*"upon"
5
0.009*"family" + 0.006*"life" + 0.006*"one" + 0.005*"experience" + 0.005*"set" + 0.005*"kill" + 0.005*"control" + 0.005*"need" + 0.005*"two" + 0.005*"year"
6
0.007*"life" + 0.006*"find" + 0.006*"young" + 0.005*"friend" + 0.005*"way" + 0.005*"meet" + 0.005*"take" + 0.005*"man" + 0.005*"dead" + 0.004*"two"
7
0.011*"life" + 0.008*"film" + 0.007*"world" + 0.005*"one" + 0.005*"come" + 0.005*"story" + 0.004*"new" + 0.004*"night" + 0.004*"get" + 0.004*"back"
8
0.008*"school" + 0.008*"life" + 0.007*"day" + 0.005*"story" + 0.005*"one" + 0.004*"men" + 0.004*"high" + 0.004*"take" + 0.004*"love" + 0.004*"plan"
9
0.008*"love" + 0.006*"make" + 0.006*"friend" + 0.006*"father" + 0.005*"good" + 0.005*"new" + 0.005*"god" + 0.004*"go" + 0.004*"family" + 0.004*"find"

save model for future use¶

ldamodel.save('topic.model')

load saved model¶

from gensim.models import LdaModel
loading = LdaModel.load('topic.model')

print(loading.print_topics(num_topics=2, num_words=4))

[(5, u'0.009*"family" + 0.006*"life" + 0.006*"one" + 0.005*"experience"'), (6, u'0.007*"life" + 0.006*"find" + 0.006*"young" + 0.005*"friend"')]

predicting(classifying) new(existing) docs¶

helper function that parse new doc to token¶

def pre_new(doc):
    one = cleaning(doc).split()
    two = dictionary.doc2bow(one)
    return two

pre_new('new article that to be classified by trained model!')

[(652, 1), (2868, 1), (4504, 1), (4858, 1)]

pass token to model¶

return the probablity of belonged topic

belong = loading[(pre_new('new article that to be classified by trained model!'))]
belong

[(0, 0.24003193096917749),
 (1, 0.020001225111365213),
 (2, 0.31763528643561079),
 (3, 0.30231838613947093),
 (4, 0.020003073834703022),
 (5, 0.020002636729036846),
 (6, 0.020001856143880722),
 (7, 0.020002015284078991),
 (8, 0.020001382653171678),
 (9, 0.020002206699504397)]

print topic¶

sort topic by probability¶

new = pd.DataFrame(belong,columns=['id','prob']).sort_values('prob',ascending=False)
new['topic'] = new['id'].apply(loading.print_topic)
new

new['topic']

2    0.011*"love" + 0.011*"story" + 0.008*"young" +...
3    0.008*"young" + 0.007*"year" + 0.006*"film" + ...
0    0.009*"one" + 0.006*"get" + 0.005*"film" + 0.0...
4    0.007*"new" + 0.006*"two" + 0.005*"get" + 0.00...
5    0.009*"family" + 0.006*"life" + 0.006*"one" + ...
9    0.008*"love" + 0.006*"make" + 0.006*"friend" +...
7    0.011*"life" + 0.008*"film" + 0.007*"world" + ...
6    0.007*"life" + 0.006*"find" + 0.006*"young" + ...
8    0.008*"school" + 0.008*"life" + 0.007*"day" + ...
1    0.012*"life" + 0.009*"year" + 0.008*"find" + 0...
Name: topic, dtype: object

plotting¶

need
- model
- corpus
- dictionary

import pyLDAvis.gensim
import gensim
pyLDAvis.enable_notebook()

d = gensim.corpora.Dictionary.load('dictionary.dict')
c = gensim.corpora.MmCorpus('corpus.mm')
lda = gensim.models.LdaModel.load('topic.model')

data = pyLDAvis.gensim.prepare(lda, c, d)
data

pyLDAvis.save_html(data,'vis.html')

# %%HTML
# <iframe width="100%" height="500" src="http://www.jishichao.com/vis"></iframe>

	topic1	topic2	topic3
article1	70.0%	65.0%	61.0%
article2	49.0%	63.0%	34.0%
article3	76.0%	61.0%	66.0%
article4	86.0%	46.0%	17.0%
article5	22.0%	63.0%	63.0%

	word1	word2	word3	word4	word5	word6	word7	word8	word9	word10	...	word91	word92	word93	word94	word95	word96	word97	word98	word99	word100
topic1	Yes	Yes	No	Yes	No	Yes	Yes	No	No	Yes	...	Yes	No	No	Yes	No	Yes	Yes	Yes	No	No
topic2	Yes	Yes	No	No	No	No	No	No	Yes	No	...	No	Yes	No	Yes	Yes	Yes	Yes	No	Yes	No
topic3	Yes	No	No	No	Yes	Yes	No	No	No	No	...	Yes	No	No	No	No	Yes	Yes	Yes	Yes	Yes

	text
0	the movie begins in the past where a young boy...
1	emerging from the human psyche and showing cha...
2	spurning her mother's insistence that she get ...

	id	prob	topic
2	2	0.317635	0.011"love" + 0.011"story" + 0.008*"young" +...
3	3	0.302318	0.008"young" + 0.007"year" + 0.006*"film" + ...
0	0	0.240032	0.009"one" + 0.006"get" + 0.005*"film" + 0.0...
4	4	0.020003	0.007"new" + 0.006"two" + 0.005*"get" + 0.00...
5	5	0.020003	0.009"family" + 0.006"life" + 0.006*"one" + ...
9	9	0.020002	0.008"love" + 0.006"make" + 0.006*"friend" +...
7	7	0.020002	0.011"life" + 0.008"film" + 0.007*"world" + ...
6	6	0.020002	0.007"life" + 0.006"find" + 0.006*"young" + ...
8	8	0.020001	0.008"school" + 0.008"life" + 0.007*"day" + ...
1	1	0.020001	0.012"life" + 0.009"year" + 0.008*"find" + 0...