In [None]:
!pip install mlxtend

Our first step is to download a piece of text from Wikipedia and to parse paragraphs.

In [None]:
from bs4 import BeautifulSoup
import requests

respond = requests.get("https://en.wikipedia.org/wiki/PoznaƄ")
soup = BeautifulSoup(respond.text, "lxml")
page = soup.find_all('p')

raw_text = [paragraph.text for paragraph in page]

print(raw_text)

Next, we will split the text into paragraphs and remove the lines with less than 3 words.

In [None]:
text = [ line.split() for line in raw_text if len(line) > 2 ]

print(text)

Our text still contains a lot of stop-words and some additional tokens such as 1.2, [2], etc. We will use the `nltk` library to remove the stop-words and we'll transform everything to alpha tokens.

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

clean_text = [
 [ 
 word.lower() 
 for word 
 in line 
 if word.isalpha() 
 and word.lower() not in stopwords.words('english') 
 ]
 for line 
 in text
]

clean_text

Now we are ready to transform the list of lists into the format suitable for association rule mining, i.e., to transform the input lists into boolean flags.

In [None]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

te = TransactionEncoder()
te_array = te.fit(clean_text).transform(clean_text)

In [None]:
# te_array contains binary version of the input data

te_array

In [None]:
te_array.shape

In [None]:
# original tokens are preserved in the columns_ field

te.columns_

`mlxtend` package assumes that the input data are stored as a `pandas.DataFrame`

In [None]:
df = pd.DataFrame(te_array, columns=te.columns_)

df.head()

Now we are ready to find frequent collections of words.

In [None]:
frequent_itemsets = apriori(df, min_support=0.05, use_colnames=True)

frequent_itemsets

We can also mine association rules which will have additional measures of quality and interestingness

In [None]:
from mlxtend.frequent_patterns import association_rules
??association_rules

In [None]:
from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, 
 metric='confidence', 
 min_threshold=0.7)

In [None]:
association_rules(frequent_itemsets, metric='lift', min_threshold=5.0)

Both frequent itemsets and association rules (antecedens and consequents) are returned as `frozenset`s, so we can use [standard API calls](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset) to find subsets, supersets, etc.

In [None]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

capital_idx = rules['antecedents'].apply(lambda x: x.issuperset({'capital','polish'}))
rules[capital_idx]