The reference guides here are for quickly understanding data formats and other standards used by Senate Matching.

Technical Requirements


See 02.00 Contributor Node System Requirements.

File Formats


The CSV file format supported by Senate Matching confirms to RFC 4180.

  • Fields are comma separated.
  • Rows are newline (CRLF) separated.
  • Double quotes are used to escape fields that might contain commas (e.g. "Smith, esq.").
  • If a double quote is part of the field, it must be escaped by preceding it with another double quote (e.g. "Dan ""The Man"" Levitan").

In addition some extensions are supported:

  • String encoding may be ASCII or UTF-8.
  • Unix-style line endings are accepted as row separators.

Glossary


TermMeaning
Aggregator Node

The technical component that handles the aggregation process for matching. The Aggregator Node combines information provided by all the Matcher Nodes and provides aggregated information (i.e. matches) to the requestor.

Closed Catalog

The Data Republic Closed Catalog displays data that has been made available for exchange or analysis by your organization on Data Republic’s Senate Data Exchange Platform. The Closed Catalog feature is only available to organizations who have licensed the Senate Platform. Data Listings within the Closed Catalog are only visible to users within your organization.

Contributor Node

The Contributor Node is the technical component that generates Tokens, hashes PII, and distributes the slices to Matcher Nodes.

Conversations

Senate has an in-built communication tool called Conversations that allows you to communicate with your project collaborators. All communications regarding a project should be done via this tool.

Database

The schema of data that an organization intends to share, including the data itself, stored on the Senate Platform. A Database may contain one or more tables containing one or more columns and zero or more rows.

Data Custodian (DC) / Contributor

A Participant who provides data for exchange on the Senate Platform.

Note that just because an organization has uploaded data they are not necessarily a Data Custodian – the data must have been uploaded for the purposes of data exchange. Uploading data for your own use is not for exchange.

A DC may also be any one of the other three Participant roles (see 'Participant').

Data Listing

Data Listings in the Senate Catalog are created and managed by Data Custodians. A Data Listing contains a data package, which may include files, tables or views associated with a particular topic or theme. Data Listings published to the Data Republic Catalog are visible to all DR participants, whereas, Data Listings published to an organization’s Closed Catalog are only visible to users within their organization.

Data Package

An acquirable set of data from a Data Custodian’s listing.

A package may include:

  • One or more Data Views, which may be a row or column subset, an aggregation, some filters, limits and orderings or any combination of those.
  • A snapshot to control whether the data is from a specific point in time or is the latest data, including updates from a Data Custodian
  • A file containing an algorithm that can transform the output from the View

Data Republic Catalog

A space for Data Custodians to share their Data Listings with approved Participants on the DR Platform. Organizations who have signed the DRPA will have access to Data Listings within the Data Republic Catalog. Data Listings contain data packages that may include files, databases, tables or views.

A database created by a Data Custodian must have an associated Term Sheet that governs its use on the Platform.

Data Product

The output of the statistical analysis that you can take off the Platform. The output must align with what is agreed to in the data license for a project.

A Data Product can be:

  • Public: can be resold to multiple clients
  • Private: only for the use of the Participant who built it
DRData Republic

Discovery Workspace (DW)

A workspace consisting of virtual machines accessed on the platform, for a specific project that allows an analyst to analyze the data they were approved to access.

Once the data license for a project is approved, participants can request for data packages to be loaded to the workspace (i.e. virtual machines) for analysis. Any output must align with the agreed permitted use outlined in the data license.

The virtual machines can run on MS Windows or Linux to host data analysis tools, such as R Studio, and copies of the datasets that you have licensed from a Data Custodian.

DRPASee Participation Agreement (DRPA).
Hash

A SHA-512 hash of any PII field. The SHA-512 hash may be truncated to have fewer than 512 bits.

Hash Split

A subset of the Hash. For example, a single Hash Split may include bits 0 through 9 while another Hash Split may include bits 10-19.

Hash Splits are distributed amongst Matcher Nodes.

License

A Data License provides guidelines for the permitted use of data packages in a project and outlines related commercials. Once a License is submitted it is pending approval. It must be approved by all project Participants, including the Participant who submitted the request. Once approved, Data Republic will provide the final approval prior to loading any data packages into a Discovery Workspace (DW) or making the data available.

Matcher Node

A Matcher Node stores hashed splices of PII during the tokenization process. This means that no one Matcher Node can contain an entire hashed field value for PII. When a request for matching is made, the Matcher Node compares hash splits and returns Token pairs to an Aggregator Node. The Aggregator Node will only retain matched tokens common to all Matcher Nodes.

Network

Senate matching operates in discrete Networks.

The Network Operator is the organization (or group) that runs the Senate Matching Network.

Node

An instance of one of the services in the Senate matching Network (e.g. Contributor, Matcher, etc). Nodes belong to an Organization which operates it and a Network in which it operates. A Node may not belong to more than one Network.

Organization

A company or other legal entity that is on the platform.

An organization that has signed the DRPA is a Participant.

Participant

A DR Participant is any company or individual working on or with the DR Platform who has signed the DRPA.

Participation Agreement ("DRPA")

All Participants must sign the Data Republic Participation Agreement (DRPA) before being allowed to access to the Senate or a DW.

The DRPA has two parts:

  • A common set of terms that all participants agree to
  • A set of modules that govern roles
Permitted Use

The Permitted Use of a data package refers to how it will be used in a Senate project. Permitted Use of a data package for a project must be requested and approved by the Data Custodian and Data Republic.

Personally Identifiable Information (PII)

Information that can be used to reasonably identify a single person. This always includes data such as an email address, street address, driver's license number, phone number, social security number. However, it may also include information that does not directly reference a person but could be used to re-identify someone when combined with other details. For example, IP addresses, location history, employment history.

Project

A governed space for Data Republic’s users to work together on the creation of Data Products.

A project contains features such as conversations to discuss data requirements, forms to request access to data for Permitted Use, and secure Discovery Workspaces to access data for analysis.

Requestor

An organization that wishes to obtain a mapping between two Contributor Databases containing Tokens. The Requester may only observe Tokens directly if they are one of the Data Custodians. Otherwise, they may only receive non-PII data.

Senate

The name of Data Republic’s Data Exchange Platform.

Senate Matching

A system for matching datasets based on Personal Identifiable Information (PII) but without ever directly accessing or exposing the plain-text PII.

Slice

A Slice is a column-wise subset of a database. A single Slice will consist of multiple rows, where each row represents a single Person. A single Slice will contain the entire Token and a single Hash Split.

Table

A table is a component in database. A database must have at least one table because tables are where data is stored. Each table is made up of rows and columns.

A table is a grid with the columns going from left to right and each data entry (or record) is listed down the grid as a row.

Term Sheet

Term Sheets are an agreement between Data Republic and the Data Custodian for databases created on the Platform. Data Custodians can use Term Sheets to set up the terms which apply to their database, how it may be used on the Platform and any commercials between the Data Custodian and Data Republic for the use of that database. Each database created by a data custodian must have its own term sheet.

Token

A surrogate key that maps to zero or one persons. The token is non-PII and is randomly generated. Tokens uniquely identify a Person within a database that issued the tokens and can be accessed via the Custodian’s Contributor Node.

Token Database

A database accessed via a Contributor Node that issues Tokens.

User

A person with a Data Republic Senate account.

View

A view is a query on a table that allows you to filter and sort the rows to present a subset of data table.