The source code used for Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding, published in KDD 2020. The code structure (especially file reading and saving functions) is adapted from the Word2Vec implementation.
- GCC compiler (used to compile the source c file): See the guide for installing GCC.
We provide two example datasets, the New York Times annotated corpus and the arXiv abstract corpus, which are used in the paper. We also provide a shell script run.sh
for compiling the source code and performing topic mining on the two example datasets. You should be able to obtain similar results as reported in the paper.
You will need to first create a directory under datasets
(e.g., datasets/your_dataset
) and put three files in it:
- A text file of the corpus, e.g.,
datasets/your_dataset/text.txt
. Note: When preparing the text corpus, make sure each line in the file is one document/paragraph. - A text file with the category names/keywords for each category, e.g.,
datasets/your_dataset/category_names.txt
where each line contains the category id (starting from 0) and the seed words for the category. You can provide arbitrary number of seed words in each line (at least 1 per category; if there are multiple seed words, separate them with whitespace characters). Note: You need to ensure that every provided seed word appears in the vocabulary of the corpus. - A category taxonomy file with the category structure, e.g.,
datasets/your_dataset/taxonomy.txt
where each line contains two category ids separated by a whitespace character. The former category is the parent category of the latter category. Note: You need to ensure that the category ids used in the taxonomy file are consistent with those in the category name file.
- You can use any tool to preprocess the corpus (e.g. tokenization, lowercasing). If you do not have a specific idea, you can use our provided preprocessing tool. Simply add your corpus directory to
auto_phrase.sh
and run it. The script assumes that the raw corpus is namedtext.txt
, and will generate a phrase-segmented, lowercased corpus namedphrase_text.txt
under the same directory. - You need to run
src/read_taxo.py
to generate two taxonomy information files,matrix_taxonomy.txt
which represents the taxonomy in matrix form, andlevel_taxonomy.txt
which records the node level information. Seerun.sh
for an example of usingsrc/read_taxo.py
to generate these two files.
We provide a 100-dimensional pretrained JoSE embedding jose_100.zip
. You can also use other pretrained embeddings (use the -load-emb
argument to specify the pretrained embedding file). Pretrained embedding is optional (omit the -load-emb
argument if you do not use pretrained embedding), but generally will result in better embedding initialization and higher-quality topic mining results.
Invoke the command without arguments for a list of parameters and their meanings:
$ ./src/josh
Parameters:
########## Input/Output: ##########
-train <file> (mandatory argument)
Use text data from <file> to train the model
-category-file <file>
Use <file> to provide the topic names/keywords
-matrix-file <file>
Use <file> to provide the taxonomy file in matrix form; generated by read_taxo.py
-level-file <file>
Use <file> to provide the node level information file; generated by read_taxo.py
-res <file>
Use <file> to save the hierarchical topic mining results
-k <int>
Set the number of terms per topic in the output file; default is 10
-word-emb <file>
Use <file> to save the resulting word embeddings
-tree-emb <file>
Use <file> to save the resulting category embeddings
-load-emb <file>
The pretrained embeddings will be read from <file>
-binary <int>
Save the resulting vectors in binary moded; default is 0 (off)
-save-vocab <file>
The vocabulary will be saved to <file>
-read-vocab <file>
The vocabulary will be read from <file>, not constructed from the training data
########## Embedding Training: ##########
-size <int>
Set dimension of text embeddings; default is 100
-iter <int>
Set the number of iterations to train on the corpus (performing topic mining); default is 5
-pretrain <int>
Set the number of iterations to pretrain on the corpus (without performing topic mining); default is 2
-expand <int>
Set the number of terms to be added per topic per iteration; default is 1
-window <int>
Set max skip length between words; default is 5
-word-margin <float>
Set the word embedding learning margin; default is 0.25
-cat-margin <float>
Set the intra-category coherence margin m_intra; default is 0.9
-sample <float>
Set threshold for occurrence of words. Those that appear with higher frequency in the training data
will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)
-negative <int>
Number of negative examples; default is 2, common values are 3 - 5 (0 = not used)
-threads <int>
Use <int> threads (default 12)
-min-count <int>
This will discard words that appear less than <int> times; default is 5
-alpha <float>
Set the starting learning rate; default is 0.025
-debug <int>
Set the debug mode (default = 2 = more info during training)
See run.sh for an example to set the arguments
Please cite the following paper if you find the code helpful for your research.
@inproceedings{meng2020hierarchical,
title={Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding},
author={Meng, Yu and Zhang, Yunyi and Huang, Jiaxin and Zhang, Yu and Zhang, Chao and Han, Jiawei},
booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
year={2020}
}