-
Notifications
You must be signed in to change notification settings - Fork 28
Metadata Database Schema
This page will explain the schema that is used to store the metadata of the callgraphs.
We decided to use PostgreSQL version 12 or higher because it has some very useful features that we utilize (like extensive JSON data type support).
Depicted below is the E/R diagram of the metadata database schema we have devised:
- The
packages
table contains basic information about the packages/products like the name of the package, forge, repository URL and etc. - Each package can have multiple versions, thus the table
package_versions
referencespackages
and stores information like version of the package. - Each package version can have multiple dependencies to other packages. Therefore table
dependencies
referencespackage_versions.id
as source package andpackages.id
as a dependent package and has aversion_range
which specifies acceptable versions of dependency. - Each
package_version
can have multiple modules in them. For some languages, these modules can be viewed as classes. Thus the tablemodules
referencespackage_versions
and stores such attributes as the ID of the name of the module that has been previously added to themodule_names
table,superClasses
andsuperInterfaces
, whether or not module has been declared final, access type of the module, annotations, and metadata about the module in form of JSON. - Each module can consist of multiple methods - callables. Therefore, table
callables
stores reference tomodules
, FASTEN URI of the method, boolean value if this callable is internal or external, line numbers of the start and the end of the method, timestamp and metadata. However, for external callable, a reference tomodules
is -1 which is a default module for all external callables. Thus, when saving the graph, all external callables are inserted in thecallables
table with -1 as a reference tomodules
andis_internal_call
set to false. - The table
call-sites
stores the graph's edges. Each call site has a source and target which both referencecallables
table allowing to store both internal and external calls. Each call site also stores call type, call line number and ID of receiver type that has been previously saved to themodule_names
table, and metadata. - Each module can also contain multiple
files
and one file can contain multiple modules. Therefore this many-to-many relationship is realized by means ofmodule_contents
table. - Each package version also has binary modules for some of the languages.
binary_modules
are very similar tomodule
, but they are supposed to store data for binary modules and link to binary files. - Similarly to modules,
binary_modules
has many-to-many relationship realized by thebinary_module_contents
table. - In C programming language programs can have virtual implementations and in order to facilitate this relationship, there is a table
virtual_implemetations
that describes a many-to-many relationship between the package_verions table and itself. -
ingested_artifacts
table contains artifacts that have been ingested using REST API's lazy ingestion -
artifact_repositories
table holds the base URLs to the artifact repositories used to download the artifacts such as Maven Central.
From POM Analyzer
POM Analyzer extracts a list of dependencies of an artifact, dependencyManagement of that artifact, repository URL, and commit tag, packaging type, project name if they are present, and generates sources URL.
Each dependency is put into the dependencies
table with detailed metadata. Here is an example of dependency's metadata:
{
"type":"",
"scope":"compile",
"groupId":"com.yahoo.athenz",
"optional":false,
"artifactId":"athenz-zms-java-client",
"classifier":"",
"exclusions":[
{
"groupId":"org.slf4j",
"artifactId":"slf4j-api"
},
{
"groupId":"org.glassfish.jersey.core",
"artifactId":"jersey-client"
},
{
"groupId":"org.glassfish.jersey.media",
"artifactId":"jersey-media-json-jackson"
},
{
"groupId":"com.fasterxml.jackson.core",
"artifactId":"jackson-core"
},
{
"groupId":"com.fasterxml.jackson.core",
"artifactId":"jackson-databind"
},
{
"groupId":"com.fasterxml.jackson.core",
"artifactId":"jackson-annotations"
}
],
"versionConstraints":[
{
"lowerBound":"1.7.28",
"upperBound":"1.7.28",
"isLowerHardRequirement":false,
"isUpperHardRequirement":false
}
]
}
The dependencyManagement
information is put into the package_versions.metadata
along with commit tag (a tag in the repository which should signify certain version), link to Maven sources JAR file, and packaging type of the artifact. Here is an example of a package version's metadata:
{
"commitTag":"2.5.0",
"sourcesUrl":"https://repo.maven.apache.org/maven2/junit/junit/4.12/junit-4.12-sources.jar",
"packagingType":"jar",
"parentCoordinate": "",
"dependencyManagement":{
"dependencies":[
{
"type":"",
"scope":"",
"groupId":"com.yahoo.elide",
"optional":false,
"artifactId":"elide-core",
"classifier":"",
"exclusions":[
],
"versionConstraints":[
{
"lowerBound":"2.5.0",
"upperBound":"2.5.0",
"isLowerHardRequirement":false,
"isUpperHardRequirement":false
}
]
}
]
}
}
From Metadata Plugin
Metadata plugin inserts metadata produced by the call graph generator and inserts this metadata into the callables
, and modules
tables.
Here is an example of module's metadata:
{
"access": "public",
"final": false,
"superInterfaces": [],
"superClasses": ["/java.lang/Object"]
}
Here is an example of callable's metadata:
{
"access": "public",
"defined": true
}
More detailed information about this metadata can be found here
Here is the SQL code that is used to create all tables described above in the diagram with all required keys, indexes, and constraints.
CREATE TABLE ingested_artifacts
(
id BIGSERIAL PRIMARY KEY,
package_name TEXT NOT NULL,
version TEXT NOT NULL,
timestamp TIMESTAMP
);
CREATE TABLE packages
(
id BIGSERIAL PRIMARY KEY,
package_name TEXT NOT NULL,
forge TEXT NOT NULL,
project_name TEXT,
repository TEXT,
created_at TIMESTAMP
);
CREATE TABLE artifact_repositories
(
id BIGSERIAL PRIMARY KEY,
repository_base_url TEXT NOT NULL
);
CREATE TABLE package_versions
(
id BIGSERIAL PRIMARY KEY,
package_id BIGINT NOT NULL REFERENCES packages (id),
version TEXT NOT NULL,
cg_generator TEXT NOT NULL,
artifact_repository_id BIGINT references artifact_repositories (id),
architecture TEXT,
created_at TIMESTAMP,
metadata JSONB
);
CREATE TABLE virtual_implementations
(
virtual_package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
package_version_id BIGINT NOT NULL REFERENCES package_versions (id)
);
CREATE TABLE dependencies
(
package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
dependency_id BIGINT NOT NULL REFERENCES packages (id),
version_range TEXT[] NOT NULL,
architecture TEXT[],
dependency_type TEXT[],
alternative_group BIGINT,
metadata JSONB
);
CREATE TABLE module_names
(
id BIGSERIAL PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TYPE ACCESS AS ENUM ('private', 'public', 'packagePrivate', 'static', 'protected');
CREATE TABLE modules
(
id BIGSERIAL PRIMARY KEY,
package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
module_name_id BIGINT NOT NULL REFERENCES module_names (id),
final BOOLEAN,
access ACCESS,
super_classes BIGINT[],
super_interfaces BIGINT[],
metadata JSONB,
annotations JSONB
);
CREATE TABLE files
(
id BIGSERIAL PRIMARY KEY,
package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
path TEXT NOT NULL,
checksum BYTEA,
created_at TIMESTAMP,
metadata JSONB
);
CREATE TABLE module_contents
(
module_id BIGINT NOT NULL REFERENCES modules (id),
file_id BIGINT NOT NULL REFERENCES files (id)
);
CREATE TABLE binary_modules
(
id BIGSERIAL PRIMARY KEY,
package_version_id BIGINT NOT NULL REFERENCES package_versions (id),
name TEXT NOT NULL,
created_at TIMESTAMP,
metadata JSONB
);
CREATE TABLE binary_module_contents
(
binary_module_id BIGINT NOT NULL REFERENCES binary_modules (id),
file_id BIGINT NOT NULL REFERENCES files (id)
);
CREATE TYPE CALLABLE_TYPE AS ENUM ('internalBinary', 'externalProduct', 'externalStaticFunction', 'externalUndefined', 'internalStaticFunction');
CREATE TABLE callables
(
id BIGSERIAL PRIMARY KEY,
module_id BIGINT NOT NULL REFERENCES modules (id),
fasten_uri TEXT NOT NULL,
is_internal_call BOOLEAN NOT NULL,
line_start INTEGER,
line_end INTEGER,
type CALLABLE_TYPE,
defined BOOLEAN,
access ACCESS,
metadata JSONB
);
CREATE TYPE CALL_TYPE AS ENUM ('static', 'dynamic', 'virtual', 'interface', 'special');
CREATE TABLE call_sites
(
source_id BIGINT NOT NULL REFERENCES callables (id),
target_id BIGINT NOT NULL REFERENCES callables (id),
line INTEGER,
call_type CALL_TYPE,
receiver_type_ids BIGINT[],
metadata JSONB
);
-- CREATE INDEX CONCURRENTLY package_versions_package_id ON package_versions USING btree (package_id);
-- CREATE INDEX CONCURRENTLY dependencies_package_version_id ON dependencies USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY dependencies_dependency_id ON dependencies USING btree (dependency_id);
-- CREATE INDEX CONCURRENTLY files_package_version_id ON files USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY modules_package_version_id ON modules USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY module_contents_module_id ON module_contents USING btree (module_id);
CREATE INDEX CONCURRENTLY module_contents_file_id ON module_contents USING btree (file_id);
-- CREATE INDEX CONCURRENTLY binary_modules_package_version_id ON binary_modules USING btree (package_version_id);
-- CREATE INDEX CONCURRENTLY binary_module_contents_binary_module_id ON binary_module_contents USING btree (binary_module_id);
-- CREATE INDEX CONCURRENTLY binary_module_contents_file_id ON binary_module_contents USING btree (file_id);
-- CREATE INDEX CONCURRENTLY callables_module_id ON callables USING btree (module_id);
-- CREATE INDEX CONCURRENTLY edges_source_id ON edges USING btree (source_id);
-- CREATE INDEX CONCURRENTLY edges_target_id ON edges USING btree (target_id);
CREATE UNIQUE INDEX CONCURRENTLY unique_ingested_artifacts ON ingested_artifacts USING btree (package_name, version);
ALTER TABLE ingested_artifacts
ADD CONSTRAINT unique_ingested_artifacts UNIQUE USING INDEX unique_ingested_artifacts;
CREATE UNIQUE INDEX CONCURRENTLY unique_package_forge ON packages USING btree (package_name, forge);
ALTER TABLE packages
ADD CONSTRAINT unique_package_forge UNIQUE USING INDEX unique_package_forge;
CREATE UNIQUE INDEX CONCURRENTLY unique_artifact_repositories ON artifact_repositories USING btree (repository_base_url);
ALTER TABLE artifact_repositories
ADD CONSTRAINT unique_artifact_repositories UNIQUE USING INDEX unique_artifact_repositories;
CREATE UNIQUE INDEX CONCURRENTLY unique_package_version_generator ON package_versions USING btree (package_id, version, cg_generator);
ALTER TABLE package_versions
ADD CONSTRAINT unique_package_version_generator UNIQUE USING INDEX unique_package_version_generator;
CREATE UNIQUE INDEX CONCURRENTLY unique_virtual_implementation ON virtual_implementations USING btree (virtual_package_version_id, package_version_id);
ALTER TABLE virtual_implementations
ADD CONSTRAINT unique_virtual_implementation UNIQUE USING INDEX unique_virtual_implementation;
CREATE UNIQUE INDEX CONCURRENTLY unique_version_dependency_range ON dependencies USING btree (package_version_id, dependency_id, version_range);
ALTER TABLE dependencies
ADD CONSTRAINT unique_version_dependency_range UNIQUE USING INDEX unique_version_dependency_range;
CREATE UNIQUE INDEX CONCURRENTLY unique_version_namespace ON modules USING btree (package_version_id, module_name_id);
ALTER TABLE modules
ADD CONSTRAINT unique_version_namespace UNIQUE USING INDEX unique_version_namespace;
CREATE UNIQUE INDEX CONCURRENTLY unique_module_file ON module_contents USING btree (module_id, file_id);
ALTER TABLE module_contents
ADD CONSTRAINT unique_module_file UNIQUE USING INDEX unique_module_file;
CREATE UNIQUE INDEX CONCURRENTLY unique_version_name ON binary_modules USING btree (package_version_id, name);
ALTER TABLE binary_modules
ADD CONSTRAINT unique_version_name UNIQUE USING INDEX unique_version_name;
CREATE UNIQUE INDEX CONCURRENTLY unique_binary_module_file ON binary_module_contents USING btree (binary_module_id, file_id);
ALTER TABLE binary_module_contents
ADD CONSTRAINT unique_binary_module_file UNIQUE USING INDEX unique_binary_module_file;
CREATE UNIQUE INDEX CONCURRENTLY unique_version_path ON files USING btree (package_version_id, path);
ALTER TABLE files
ADD CONSTRAINT unique_version_path UNIQUE USING INDEX unique_version_path;
CREATE UNIQUE INDEX CONCURRENTLY unique_uri_call ON callables USING btree (module_id, fasten_uri, is_internal_call);
ALTER TABLE callables
ADD CONSTRAINT unique_uri_call UNIQUE USING INDEX unique_uri_call;
CREATE UNIQUE INDEX CONCURRENTLY unique_source_target ON call_sites USING btree (source_id, target_id);
ALTER TABLE call_sites
ADD CONSTRAINT unique_source_target UNIQUE USING INDEX unique_source_target;
ALTER TABLE callables
ADD CONSTRAINT check_module_id CHECK ((module_id = -1 AND is_internal_call IS false) OR
(module_id IS NOT NULL AND is_internal_call IS true));
CREATE UNIQUE INDEX CONCURRENTLY unique_module_names ON module_names USING btree (name);
ALTER TABLE module_names
ADD CONSTRAINT unique_module_names UNIQUE USING INDEX unique_module_names;
INSERT INTO packages (id, package_name, forge)
VALUES (-1, 'external_callables_library', 'mvn')
ON CONFLICT DO NOTHING;
INSERT INTO artifact_repositories (id, repository_base_url)
VALUES (-1, 'https://repo.maven.apache.org/maven2/')
ON CONFLICT DO NOTHING;
INSERT INTO package_versions (id, package_id, version, cg_generator, artifact_repository_id)
VALUES (-1, -1, '0.0.1', 'OPAL', -1)
ON CONFLICT DO NOTHING;
INSERT INTO module_names (id, name)
VALUES (-1, 'global_external_callables')
ON CONFLICT DO NOTHING;
INSERT INTO modules (id, package_version_id, module_name_id)
VALUES (-1, -1, -1)
ON CONFLICT DO NOTHING;
-- Tables and indexes related to vulnerabilities ------------------------------
CREATE TABLE vulnerabilities
(
id BIGSERIAL PRIMARY KEY,
external_id TEXT NOT NULL,
statement JSONB NOT NULL
);
CREATE TABLE vulnerabilities_purls
(
vulnerability_id BIGINT NOT NULL REFERENCES vulnerabilities (id),
purl TEXT NOT NULL,
forge TEXT NOT NULL,
package_name TEXT NOT NULL,
package_version TEXT NOT NULL
);
CREATE TABLE vulnerabilities_x_package_versions
(
vulnerability_id BIGINT NOT NULL REFERENCES vulnerabilities (id),
package_version_id BIGINT NOT NULL REFERENCES package_versions (id)
);
CREATE TABLE vulnerabilities_x_callables
(
vulnerability_id BIGINT NOT NULL REFERENCES vulnerabilities (id),
callable_id BIGINT NOT NULL REFERENCES callables (id)
);
CREATE UNIQUE INDEX CONCURRENTLY unique_vulnerabilities ON vulnerabilities USING btree (external_id);
ALTER TABLE vulnerabilities ADD CONSTRAINT unique_vulnerabilities UNIQUE USING INDEX unique_vulnerabilities;
CREATE INDEX CONCURRENTLY vulnerabilities_purls_vulnerability_id ON vulnerabilities_purls USING btree (vulnerability_id);
CREATE INDEX CONCURRENTLY vulnerability_purls_purl ON vulnerabilities_purls USING btree (purl);
CREATE INDEX CONCURRENTLY vulnerabilities_purls_multi ON vulnerabilities_purls USING btree (forge, package_name, package_version);
CREATE INDEX CONCURRENTLY vulnerabilities_x_package_versions_vulnerability_id ON vulnerabilities_x_package_versions USING btree (vulnerability_id);
CREATE INDEX CONCURRENTLY vulnerabilities_x_package_versions_package_version_id ON vulnerabilities_x_package_versions USING btree (package_version_id);
CREATE INDEX CONCURRENTLY vulnerabilities_x_callables_vulnerability_id ON vulnerabilities_x_callables USING btree (vulnerability_id);
CREATE INDEX CONCURRENTLY vulnerabilities_x_callables_callable_id ON vulnerabilities_x_callables USING btree (callable_id);
-- End tables and indexes related to vulnerabilities ------------------------------
The following indices are used to optimise certain frequently-used queries, and should be installed on a per need basis
CREATE INDEX CONCURRENTLY "idx_callables_fasten_uri"
ON "callables"
USING btree (digest("fasten_uri", 'sha1'::text));
This index can be used by using a query like the following:
select c.id, c.fasten_uri
from callables c join modules m on c.module_id = m.id
join package_versions pv on pv.id = m.package_version_id
join packages p on p.id = pv.package_id
where
pv.version = '0.9.2' and p.package_name = 'com.eclipsesource.minimal-json:minimal-json'
and digest(c.fasten_uri, 'sha1'::text) in (
digest('/com.eclipsesource.json/ParseException.%3Cinit%3E(%2Fjava.lang%2FString,%2Fjava.lang%2FIntegerType,%2Fjava.lang%2FIntegerType,%2Fjava.lang%2FIntegerType)%2Fjava.lang%2FVoidType', 'sha1'::text),
digest('/com.eclipsesource.json/JsonWriter.writeObjectValueSeparator()%2Fjava.lang%2FVoidType', 'sha1'::text)
);
The following need to be executed once on the master using the Postgres superuser (usually, postgres
) to initialise users and databases.
-- Create default read/write user and databases
create user fasten with encrypted password PASSWD; -- change me
ALTER USER fasten with CREATEDB;
create database fasten_java ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;
create database fasten_rust ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;
create database fasten_c ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;
create database fasten_python ENCODING 'UTF8' LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' TEMPLATE template0;
grant all privileges on database fasten_java to fasten;
grant all privileges on database fasten_rust to fasten;
grant all privileges on database fasten_c to fasten;
grant all privileges on database fasten_python to fasten;
-- Create a readonly user for querying
CREATE USER fastenro WITH ENCRYPTED PASSWORD 'fasten';
GRANT CONNECT ON DATABASE fasten_java TO fastenro;
GRANT CONNECT ON DATABASE fasten_rust TO fastenro;
GRANT CONNECT ON DATABASE fasten_c TO fastenro;
GRANT CONNECT ON DATABASE fasten_python TO fastenro;
ALTER DATABASE fasten_java OWNER to fasten;
ALTER DATABASE fasten_rust OWNER to fasten;
ALTER DATABASE fasten_c OWNER to fasten;
ALTER DATABASE fasten_python OWNER to fasten;
\c fasten_java
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;
CREATE EXTENSION pgcrypto;
\c fasten_rust
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;
CREATE EXTENSION pgcrypto;
\c fasten_c
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;
CREATE EXTENSION pgcrypto;
\c fasten_python
GRANT USAGE ON SCHEMA public TO fastenro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO fastenro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO fastenro;
CREATE EXTENSION pgcrypto;