Skip to content

MikeLab 2024 subproject. Developed for the computation of PageRank & TrustRank score.

License

Notifications You must be signed in to change notification settings

RJTPP/TextGraphRank

Repository files navigation

Text Processing for PageRank and TrustRank Calculation

Python 3.6+ License: MIT

This Python project is designed for text processing, focusing on the computation of PageRank and TrustRank scores from a dataset in JSON format using graph-based techniques.

Applications: Suitable for applications such as text summarization, term importance analysis, and content ranking.

Table of Contents

Quick Start

  1. Ensure that Python (version 3.6 or later) and pip are installed on your system.

  2. Clone the repository and navigate to the directory:

git clone https://github.com/RJTPP/TextGraphRank.git &&
cd TextGraphRank
  1. Run the setup script. This will create a virtual environment and install the required dependencies.
python3 setup.py
  1. Activate the virtual environment.
  • For Linux/macOS
source venv/bin/activate
  • For Windows
venv\Scripts\activate
  1. Configure the project by editing config.json

    • Set paths for datasets and outputs.
    • Define algorithm parameters and workflow options.
    • For more details, see Configuration section.
  2. Add your dataset to the dataset/ directory.

    • Ensure the dataset is in JSON format and contains the structure specified in the Dataset Structure section.
  3. Run the main script. Optional arguments can be provided (optional). See Options section for details.

python3 main.py [OPTIONS]
  1. View the results in the output/ directory. See Output section for an example.

Note

For further details, please refer to the Installation and Usage sections.

Requirements and Dependencies

This project was developed using Python 3.6 and is tested to be compatible with Python 3.6 through 3.12 and should work with newer versions. It requires the following dependencies:

  • networkx
  • matplotlib
  • scipy
  • regex
  • tqdm
  • nltk
  • pathlib
  • numpy
  • orjson

Note

For a complete list of dependencies, see requirements.txt

Installation

  1. Clone the repository.
git clone https://github.com/RJTPP/TextGraphRank.git &&
cd TextGraphRank
  1. Run setup.py,This will create a virtual environment and install the required dependencies.
python3 setup.py

Manual Installation (Optional)

If you prefer to install the dependencies manually, follow these steps:

  1. Create a virtual environment.
python3 -m venv venv
  1. Activate the virtual environment.
  • For Linux/macOS
source venv/bin/activate
  • For Windows
venv\Scripts\activate
  1. Upgrade pip.
pip install --upgrade pip
  1. Install the required dependencies.
pip install -r requirements.txt

Configuration

The project can be configured through config.json, which contains:

Path Configuration

  • cached_dir: Directory for storing cache data.
  • dataset_dir: Directory containing input JSON datasets.
  • output_dir: Directory where processed results and outputs will be stored.

Algorithm Parameters

  • calculation_threshold: Convergence threshold for iterative calculations like PageRank and TrustRank.
  • max_calculation_iteration: Maximum number of iterations for the scoring algorithms.
  • trustrank_bias_amount: Number of nodes or elements to bias in TrustRank, chosen from most scored from inverse PageRank.
  • max_summarize_length: Maximum number of iterations for TrustRank algorithm. This will also be the maximum number of nodes or elements to summarize.

Workflow Options

  • use_pagerank_library: Set to true to use a library-based PageRank implementation (networkx) or false for the custom implementation.
  • output_graph: If true, saves the generated graphs as files in the output directory.
  • show_graph: If true, displays graphs during execution (requires a GUI).

Target Data Keys

  • target_data_key: Specifies which keys from the JSON dataset to process. See Dataset Structure for details.

Example Configuration

{
    "path": {
        "cached_dir"  : "caches",
        "dataset_dir" : "dataset",
        "output_dir"  : "output"
    },
    "parameters": {
        "calculation_threshold"     : 1e-5,
        "max_calculation_iteration" : 200,
        "trustrank_bias_amount"     : 1,
        "max_summarize_length"      : 20
    },
    "options": {
        "use_pagerank_library" : false,
        "output_graph"         : true,
        "show_graph"           : false
    },
    "target_data_key": [
        "target_key",
        "nested_target_key"
    ]
}

Tip

To print the currently configured paths in config.json, you can run the verify_path.py script.

Dataset Structure

Input datasets must be JSON files structured as an array of dictionaries or an array of strings to ensure compatibility with the workflow. Below is an example of the expected dataset format:

Example 1: Array of Dictionaries

[
    {
        "id"  : "001",
        "data": {
            "full_text" : "This is a sample text.",
            "author"    : "author_name",
            "date"      : "2024-01-01"
        }
    },
    {
        "id"  : "002",
        "data": {
            "full_text" : "This is another sample text.",
            "author"    : "another_author_name",
            "date"      : "2024-01-02"
        }
    }
]
  • In this example, the data.full_text field contains the text to be processed.
  • target_data_key : ["data", "full_text"]

Example 2: Array of Strings

[
    "This is a sample text.",
    "This is another sample text."
]
  • target_data_key: Leave as an empty array [ ].

Usage

To execute the project, follow these steps:

1. Add Dataset:

  • Ensure your dataset files are in the directory specified by the dataset_dir field in config.json. The default directory is dataset/.

2. Activate the virtual environment:

  • For Linux/macOS
source venv/bin/activate
  • For Windows
venv\Scripts\activate

3. Run the Main Script:

python3 main.py [OPTIONS]

Options:

  • -f, --files: Specify one or more JSON files to process.
python3 main.py -f file1.json file2.json
  • -e, --exclude: Exclude one or more JSON files from processing.
python3 main.py -e file1.json

Note

If no options are provided, the script processes all JSON files in the dataset directory by default.

Output

The project generates the following output files in the output/ directory or configured output directory.

  • graph_{name}.json: Represents the generated bigram graph. Each bigram consists of two consecutive words and its associated weight (frequency in the text).

Example:

[
  [
    "word1 word2",
    "word2 word3",
    2 
  ],
]
  • inverse_pagerank_{name}.json and trust_rank_{name}.json Contain ranking scores for terms or nodes.

Example:

[
  ["word1 word2", 0.7],
  ["word2 word3", 0.3]
]

Note

Higher scores indicate greater importance or relevance in the dataset.

Workflow Overview

This project processes textual data in four stages:

1. Text Preprocessing

The text preprocessing module applies several cleaning techniques:

  • Converts text to lowercase
  • Expands contractions and replaces slang
  • Removes non-alphabetic characters, punctuation, and URLs
  • Removes stopwords

2. Bigram Graph Generation

  • Converts processed text into bigrams
  • Generates weighted bigrams and graphs (library-based or custom implementation depending on the configuration)
  • Optionally visualizes graphs using matplotlib

3. Score Calculation

  • Calculates PageRank using either library-based or customized algorithm depending on configuration
  • Calculates TrustRank using customized algorithm based on seeded bias.

4. Results Output

  • Processed results are saved to the output directory as defined in the configuration file.
  • Following calculations will be saved:
    • Bigram Graph
    • Inverse PageRank
    • TrustRank

Project Directory Structure

work/
├── caches/
│   └── ... (Cache files)
│   
├── dataset/
│   └── ... (Put your dataset here)
│
├── helper_script/              # Utility scripts
│   ├── __init__.py
│   └── file_reader_helper.py   # File related helper functions
│   ├── json_helper.py          # JSON helper functions
│   ├── func_timer.py           # Timer for monitoring function runtime
│
├── modules_script/             # Core processing modules
│   ├── __init__.py  
│   ├── m_graph_custom.py       # Graph generation for calculating inverse pagerank (custom implementation)
│   ├── m_graph_nx.py           # Graph generation from bigrams (networkx library)
│   └── m_preprocess_text.py    # Text preprocessing logic
│   └── m_process_text.py       # Text to bigrams logic
│
├── config.json                 # Configuration
├── main.py                     # Main script
├── requirements.txt      
├── settings.py                 # Handle settings from config.json 
├── setup.py                    # Setup script
├── verify_path.py              # Verify path settings
└── venv                        # Virtual environment

License

This project is released under the MIT License.

You are free to use, modify, and distribute this software under the terms of the MIT License. See the LICENSE file for detailed terms and conditions.

Contributors

Rajata Thamcharoensatit (@RJTPP)

About

MikeLab 2024 subproject. Developed for the computation of PageRank & TrustRank score.

Topics

Resources

License

Stars

Watchers

Forks

Languages