Skip to content

Metadata Database Schema

Chushu Gao edited this page Nov 24, 2021 · 41 revisions

This page will explain the schema that is used to store the metadata of the callgraphs.

Database

We decided to use PostgreSQL version 12 or higher because it has some very useful features that we utilize (like extensive JSON data type support).

Schema

Depicted below is the E/R diagram of the metadata database schema we have devised: FASTEN Metadata Database Schema

Explanation

  • The packages table contains basic information about the packages/products like the name of the package, forge, repository URL and etc.
  • Each package can have multiple versions, thus the table package_versions references packages and stores information like version of the package.
  • Each package version can have multiple dependencies to other packages. Therefore table dependencies references package_versions.id as source package and packages.id as a dependent package and has a version_range which specifies acceptable versions of dependency.
  • Each package_version can have multiple modules in them. For some languages, these modules can be viewed as classes. Thus the table modules references package_versions and stores such attributes as the ID of the name of the module that has been previously added to the module_names table,superClasses and superInterfaces, whether or not module has been declared final, access type of the module, annotations, and metadata about the module in form of JSON.
  • Each module can consist of multiple methods - callables. Therefore, table callables stores reference to modules, FASTEN URI of the method, boolean value if this callable is internal or external, line numbers of the start and the end of the method, timestamp and metadata. However, for external callable, a reference to modules is -1 which is a default module for all external callables. Thus, when saving the graph, all external callables are inserted in the callables table with -1 as a reference to modules and is_internal_call set to false.
  • The table call-sites stores the graph's edges. Each call site has a source and target which both reference callables table allowing to store both internal and external calls. Each call site also stores call type, call line number and ID of receiver type that has been previously saved to the module_names table, and metadata.
  • Each module can also contain multiple files and one file can contain multiple modules. Therefore this many-to-many relationship is realized by means of module_contents table.
  • Each package version also has binary modules for some of the languages. binary_modules are very similar to module, but they are supposed to store data for binary modules and link to binary files.
  • Similarly to modules, binary_modules has many-to-many relationship realized by the binary_module_contents table.
  • In C programming language programs can have virtual implementations and in order to facilitate this relationship, there is a table virtual_implemetations that describes a many-to-many relationship between the package_verions table and itself.
  • ingested_artifacts table contains artifacts that have been ingested using REST API's lazy ingestion
  • artifact_repositories table holds the base URLs to the artifact repositories used to download the artifacts such as Maven Central.

Additional metadata

From POM Analyzer

POM Analyzer extracts a list of dependencies of an artifact, dependencyManagement of that artifact, repository URL, and commit tag, packaging type, project name if they are present, and generates sources URL. Each dependency is put into the dependencies table with detailed metadata. Here is an example of dependency's metadata:

{
   "type":"",
   "scope":"compile",
   "groupId":"com.yahoo.athenz",
   "optional":false,
   "artifactId":"athenz-zms-java-client",
   "classifier":"",
   "exclusions":[
      {
         "groupId":"org.slf4j",
         "artifactId":"slf4j-api"
      },
      {
         "groupId":"org.glassfish.jersey.core",
         "artifactId":"jersey-client"
      },
      {
         "groupId":"org.glassfish.jersey.media",
         "artifactId":"jersey-media-json-jackson"
      },
      {
         "groupId":"com.fasterxml.jackson.core",
         "artifactId":"jackson-core"
      },
      {
         "groupId":"com.fasterxml.jackson.core",
         "artifactId":"jackson-databind"
      },
      {
         "groupId":"com.fasterxml.jackson.core",
         "artifactId":"jackson-annotations"
      }
   ],
   "versionConstraints":[
      {
         "lowerBound":"1.7.28",
         "upperBound":"1.7.28",
         "isLowerHardRequirement":false,
         "isUpperHardRequirement":false
      }
   ]
}

The dependencyManagement information is put into the package_versions.metadata along with commit tag (a tag in the repository which should signify certain version), link to Maven sources JAR file, and packaging type of the artifact. Here is an example of a package version's metadata:

{
   "commitTag":"2.5.0",
   "sourcesUrl":"https://repo.maven.apache.org/maven2/junit/junit/4.12/junit-4.12-sources.jar",
   "packagingType":"jar",
   "parentCoordinate": "",
   "dependencyManagement":{
      "dependencies":[
         {
            "type":"",
            "scope":"",
            "groupId":"com.yahoo.elide",
            "optional":false,
            "artifactId":"elide-core",
            "classifier":"",
            "exclusions":[

            ],
            "versionConstraints":[
               {
                  "lowerBound":"2.5.0",
                  "upperBound":"2.5.0",
                  "isLowerHardRequirement":false,
                  "isUpperHardRequirement":false
               }
            ]
         }
      ]
   }
}

From Metadata Plugin

Metadata plugin inserts metadata produced by the call graph generator and inserts this metadata into the callables, and modules tables. Here is an example of module's metadata:

{
    "access": "public",
    "final": false,
    "superInterfaces": [],
    "superClasses": ["/java.lang/Object"]
}

Here is an example of callable's metadata:

{
    "access": "public",
    "defined": true
}

More detailed information about this metadata can be found here

SQL to create database tables

Here is the SQL code that is used to create all tables described above in the diagram with all required keys, indexes, and constraints.

CREATE TABLE ingested_artifacts
(
    id           BIGSERIAL PRIMARY KEY,
    package_name TEXT NOT NULL,
    version      TEXT NOT NULL,
    timestamp    TIMESTAMP
);

CREATE TABLE packages
(
    id           BIGSERIAL PRIMARY KEY,
    package_name TEXT NOT NULL,
    forge        TEXT NOT NULL,
    project_name TEXT,
    repository   TEXT,
    created_at   TIMESTAMP
);

CREATE TABLE artifact_repositories
(
    id                  BIGSERIAL PRIMARY KEY,
    repository_base_url TEXT NOT NULL
);

CREATE TABLE package_versions
(
    id                     BIGSERIAL PRIMARY KEY,
    package_id             BIGINT NOT NULL REFERENCES packages (id),
    version                TEXT   NOT NULL,
    cg_generator           TEXT   NOT NULL,
    artifact_repository_id BIGINT references artifact_repositories (id),
    architecture           TEXT,
    created_at             TIMESTAMP,
    metadata               JSONB
);

CREATE TABLE virtual_implementations
(
    virtual_package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
    package_version_id         BIGINT NOT NULL REFERENCES package_versions (id)
);

CREATE TABLE dependencies
(
    package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
    dependency_id      BIGINT NOT NULL REFERENCES packages (id),
    version_range      TEXT[] NOT NULL,
    architecture       TEXT[],
    dependency_type    TEXT[],
    alternative_group  BIGINT,
    metadata           JSONB
);

CREATE TABLE module_names
(
    id   BIGSERIAL PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TYPE ACCESS AS ENUM ('private', 'public', 'packagePrivate', 'static', 'protected');

CREATE TABLE modules
(
    id                 BIGSERIAL PRIMARY KEY,
    package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
    module_name_id     BIGINT NOT NULL REFERENCES module_names (id),
    final              BOOLEAN,
    access             ACCESS,
    super_classes      BIGINT[],
    super_interfaces   BIGINT[],
    metadata           JSONB,
    annotations        JSONB
);

CREATE TABLE files
(
    id                 BIGSERIAL PRIMARY KEY,
    package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
    path               TEXT   NOT NULL,
    checksum           BYTEA,
    created_at         TIMESTAMP,
    metadata           JSONB
);

CREATE TABLE module_contents
(
    module_id BIGINT NOT NULL REFERENCES modules (id),
    file_id   BIGINT NOT NULL REFERENCES files (id)
);

CREATE TABLE binary_modules
(
    id                 BIGSERIAL PRIMARY KEY,
    package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
    name               TEXT   NOT NULL,
    created_at         TIMESTAMP,
    metadata           JSONB
);

CREATE TABLE binary_module_contents
(
    binary_module_id BIGINT NOT NULL REFERENCES binary_modules (id),
    file_id          BIGINT NOT NULL REFERENCES files (id)
);

CREATE TYPE CALLABLE_TYPE AS ENUM ('internalBinary', 'externalProduct', 'externalStaticFunction', 'externalUndefined', 'internalStaticFunction');

CREATE TABLE callables
(
    id               BIGSERIAL PRIMARY KEY,
    module_id        BIGINT  NOT NULL REFERENCES modules (id),
    fasten_uri       TEXT    NOT NULL,
    is_internal_call BOOLEAN NOT NULL,
    line_start       INTEGER,
    line_end         INTEGER,
    type             CALLABLE_TYPE,
    defined          BOOLEAN,
    access           ACCESS,
    metadata         JSONB
);

CREATE TYPE CALL_TYPE AS ENUM ('static', 'dynamic', 'virtual', 'interface', 'special');

CREATE TABLE call_sites
(
    source_id         BIGINT NOT NULL REFERENCES callables (id),
    target_id         BIGINT NOT NULL REFERENCES callables (id),
    line              INTEGER,
    call_type         CALL_TYPE,
    receiver_type_ids BIGINT[],
    metadata          JSONB
);

-- CREATE INDEX CONCURRENTLY package_versions_package_id ON package_versions USING btree (package_id);
-- CREATE INDEX CONCURRENTLY dependencies_package_version_id ON dependencies USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY dependencies_dependency_id ON dependencies USING btree (dependency_id);
-- CREATE INDEX CONCURRENTLY files_package_version_id ON files USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY modules_package_version_id ON modules USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY module_contents_module_id ON module_contents USING btree (module_id);
CREATE INDEX CONCURRENTLY module_contents_file_id ON module_contents USING btree (file_id);
-- CREATE INDEX CONCURRENTLY binary_modules_package_version_id ON binary_modules USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY binary_module_contents_binary_module_id ON binary_module_contents USING btree (binary_module_id);
-- CREATE INDEX CONCURRENTLY binary_module_contents_file_id ON binary_module_contents USING btree (file_id);
-- CREATE INDEX CONCURRENTLY callables_module_id ON callables USING btree (module_id);
-- CREATE INDEX CONCURRENTLY edges_source_id ON edges USING btree (source_id);
-- CREATE INDEX CONCURRENTLY edges_target_id ON edges USING btree (target_id);

CREATE UNIQUE INDEX CONCURRENTLY unique_ingested_artifacts ON ingested_artifacts USING btree (package_name, version);
ALTER TABLE ingested_artifacts
    ADD CONSTRAINT unique_ingested_artifacts UNIQUE USING INDEX unique_ingested_artifacts;

CREATE UNIQUE INDEX CONCURRENTLY unique_package_forge ON packages USING btree (package_name, forge);
ALTER TABLE packages
    ADD CONSTRAINT unique_package_forge UNIQUE USING INDEX unique_package_forge;

CREATE UNIQUE INDEX CONCURRENTLY unique_artifact_repositories ON artifact_repositories USING btree (repository_base_url);
ALTER TABLE artifact_repositories
    ADD CONSTRAINT unique_artifact_repositories UNIQUE USING INDEX unique_artifact_repositories;

CREATE UNIQUE INDEX CONCURRENTLY unique_package_version_generator ON package_versions USING btree (package_id, version, cg_generator);
ALTER TABLE package_versions
    ADD CONSTRAINT unique_package_version_generator UNIQUE USING INDEX unique_package_version_generator;

CREATE UNIQUE INDEX CONCURRENTLY unique_virtual_implementation ON virtual_implementations USING btree (virtual_package_version_id, package_version_id);
ALTER TABLE virtual_implementations
    ADD CONSTRAINT unique_virtual_implementation UNIQUE USING INDEX unique_virtual_implementation;

CREATE UNIQUE INDEX CONCURRENTLY unique_version_dependency_range ON dependencies USING btree (package_version_id, dependency_id, version_range);
ALTER TABLE dependencies
    ADD CONSTRAINT unique_version_dependency_range UNIQUE USING INDEX unique_version_dependency_range;

CREATE UNIQUE INDEX CONCURRENTLY unique_version_namespace ON modules USING btree (package_version_id, module_name_id);
ALTER TABLE modules
    ADD CONSTRAINT unique_version_namespace UNIQUE USING INDEX unique_version_namespace;

CREATE UNIQUE INDEX CONCURRENTLY unique_module_file ON module_contents USING btree (module_id, file_id);
ALTER TABLE module_contents
    ADD CONSTRAINT unique_module_file UNIQUE USING INDEX unique_module_file;

CREATE UNIQUE INDEX CONCURRENTLY unique_version_name ON binary_modules USING btree (package_version_id, name);
ALTER TABLE binary_modules
    ADD CONSTRAINT unique_version_name UNIQUE USING INDEX unique_version_name;

CREATE UNIQUE INDEX CONCURRENTLY unique_binary_module_file ON binary_module_contents USING btree (binary_module_id, file_id);
ALTER TABLE binary_module_contents
    ADD CONSTRAINT unique_binary_module_file UNIQUE USING INDEX unique_binary_module_file;

CREATE UNIQUE INDEX CONCURRENTLY unique_version_path ON files USING btree (package_version_id, path);
ALTER TABLE files
    ADD CONSTRAINT unique_version_path UNIQUE USING INDEX unique_version_path;

CREATE UNIQUE INDEX CONCURRENTLY unique_uri_call ON callables USING btree (module_id, fasten_uri, is_internal_call);
ALTER TABLE callables
    ADD CONSTRAINT unique_uri_call UNIQUE USING INDEX unique_uri_call;

CREATE UNIQUE INDEX CONCURRENTLY unique_source_target ON call_sites USING btree (source_id, target_id);
ALTER TABLE call_sites
    ADD CONSTRAINT unique_source_target UNIQUE USING INDEX unique_source_target;

ALTER TABLE callables
    ADD CONSTRAINT check_module_id CHECK ((module_id = -1 AND is_internal_call IS false) OR
                                          (module_id IS NOT NULL AND is_internal_call IS true));

CREATE UNIQUE INDEX CONCURRENTLY unique_module_names ON module_names USING btree (name);
ALTER TABLE module_names
    ADD CONSTRAINT unique_module_names UNIQUE USING INDEX unique_module_names;

INSERT INTO packages (id, package_name, forge)
VALUES (-1, 'external_callables_library', 'mvn')
ON CONFLICT DO NOTHING;

INSERT INTO artifact_repositories (id, repository_base_url)
VALUES (-1, 'https://repo.maven.apache.org/maven2/')
ON CONFLICT DO NOTHING;

INSERT INTO package_versions (id, package_id, version, cg_generator, artifact_repository_id)
VALUES (-1, -1, '0.0.1', 'OPAL', -1)
ON CONFLICT DO NOTHING;

INSERT INTO module_names (id, name)
VALUES (-1, 'global_external_callables')
ON CONFLICT DO NOTHING;

INSERT INTO modules (id, package_version_id, module_name_id)
VALUES (-1, -1, -1)
ON CONFLICT DO NOTHING;
-- Tables and indexes related to vulnerabilities ------------------------------
CREATE TABLE vulnerabilities
(
    id          BIGSERIAL PRIMARY KEY,
    external_id TEXT NOT NULL,
    statement   JSONB NOT NULL
);

CREATE TABLE vulnerabilities_purls
(
    vulnerability_id BIGINT NOT NULL REFERENCES vulnerabilities (id),
    purl             TEXT NOT NULL,
    forge            TEXT NOT NULL,
    package_name     TEXT NOT NULL,
    package_version  TEXT NOT NULL
);

CREATE TABLE vulnerabilities_x_package_versions
(
    vulnerability_id   BIGINT NOT NULL REFERENCES vulnerabilities (id),
    package_version_id BIGINT NOT NULL REFERENCES package_versions (id)
);

CREATE TABLE vulnerabilities_x_callables
(
    vulnerability_id BIGINT NOT NULL REFERENCES vulnerabilities (id),
    callable_id      BIGINT NOT NULL REFERENCES callables (id)
);

CREATE UNIQUE INDEX CONCURRENTLY unique_vulnerabilities ON vulnerabilities USING btree (external_id);
ALTER TABLE vulnerabilities ADD CONSTRAINT unique_vulnerabilities UNIQUE USING INDEX unique_vulnerabilities;

CREATE INDEX CONCURRENTLY vulnerabilities_purls_vulnerability_id ON vulnerabilities_purls USING btree (vulnerability_id);
CREATE INDEX CONCURRENTLY vulnerability_purls_purl ON vulnerabilities_purls USING btree (purl);
CREATE INDEX CONCURRENTLY vulnerabilities_purls_multi ON vulnerabilities_purls USING btree (forge, package_name, package_version);

CREATE INDEX CONCURRENTLY vulnerabilities_x_package_versions_vulnerability_id ON vulnerabilities_x_package_versions USING btree (vulnerability_id);
CREATE INDEX CONCURRENTLY vulnerabilities_x_package_versions_package_version_id ON vulnerabilities_x_package_versions USING btree (package_version_id);
CREATE INDEX CONCURRENTLY vulnerabilities_x_callables_vulnerability_id ON vulnerabilities_x_callables USING btree (vulnerability_id);
CREATE INDEX CONCURRENTLY vulnerabilities_x_callables_callable_id ON vulnerabilities_x_callables USING btree (callable_id);

-- End tables and indexes related to vulnerabilities ------------------------------

Query optimization indices

The following indices are used to optimise certain frequently-used queries, and should be installed on a per need basis

Finding callables by fasten_uri

CREATE INDEX CONCURRENTLY "idx_callables_fasten_uri"
ON "callables"
USING btree (digest("fasten_uri",  'sha1'::text));

This index can be used by using a query like the following:

select c.id, c.fasten_uri
from callables c join modules m on c.module_id = m.id 
    join package_versions pv on pv.id = m.package_version_id
    join packages p on p.id = pv.package_id
where 
    pv.version = '0.9.2' and p.package_name = 'com.eclipsesource.minimal-json:minimal-json'
    and digest(c.fasten_uri, 'sha1'::text) in (
        digest('/com.eclipsesource.json/ParseException.%3Cinit%3E(%2Fjava.lang%2FString,%2Fjava.lang%2FIntegerType,%2Fjava.lang%2FIntegerType,%2Fjava.lang%2FIntegerType)%2Fjava.lang%2FVoidType', 'sha1'::text),
        digest('/com.eclipsesource.json/JsonWriter.writeObjectValueSeparator()%2Fjava.lang%2FVoidType', 'sha1'::text)
);

SQL to initialise users and permissions

The following need to be executed once on the master using the Postgres superuser (usually, postgres) to initialise users and databases.

-- Create default read/write user and databases

create user fasten with encrypted password PASSWD; -- change me
ALTER USER fasten with CREATEDB;

create database fasten_java ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;
create database fasten_rust ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;
create database fasten_c ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;
create database fasten_python ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;

grant all privileges on database fasten_java to fasten;
grant all privileges on database fasten_rust to fasten;
grant all privileges on database fasten_c to fasten;
grant all privileges on database fasten_python to fasten;
-- Create a readonly user for querying

CREATE USER fastenro WITH ENCRYPTED PASSWORD 'fasten';

GRANT CONNECT ON DATABASE fasten_java TO fastenro;
GRANT CONNECT ON DATABASE fasten_rust TO fastenro;
GRANT CONNECT ON DATABASE fasten_c TO fastenro;
GRANT CONNECT ON DATABASE fasten_python TO fastenro;

ALTER DATABASE fasten_java OWNER to fasten;
ALTER DATABASE fasten_rust OWNER to fasten;
ALTER DATABASE fasten_c OWNER to fasten;
ALTER DATABASE fasten_python OWNER to fasten;

\c fasten_java
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;
CREATE EXTENSION pgcrypto;

\c fasten_rust
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;
CREATE EXTENSION pgcrypto;

\c fasten_c
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;
CREATE EXTENSION pgcrypto;

\c fasten_python
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;
CREATE EXTENSION pgcrypto;