Load PI

Stop!

Stop! Before you load PI you need to have signed the Senate Matching license agreement. You may test with synthetic data in the meantime.

#

Step

Screenshot & Notes

01

Splitting PII from your customer data

Datasets loaded on to the Senate Platform must not contain PII. If your existing dataset includes PII, it will need to be split into two separate datasets; one containing the non-identifiable transactional or demographic information for your customer (often referred to as “attribute” or “transaction” data), and a separate table that contains any PII for that same customer (usually referred to as “PII” or “identity” data).

Notes

  1. Senate matching now supports matching by multiple fields (in the past only 'email' field was an option)
  2. Senate matching now supports matching by multi-valued fields (for example: users can have more then one Email address)

In the example below, “personid” refers to an identifier from your internal CRM (customer relationship management) system. You will probably want to store the tokens generated for each customer in your CRM. Data Republic recommends that Senate Matching tokens are the only customer identification included in the attribute and transaction data that you load into Senate.

Customer data split into PII and attribute tables

personidemailphonedpidnationalidfamily_namegiven_namefrequent_flyer_number
11alison@email.com(555) 623-25653607kclnMPjqSuttonAlisonVb8270926047
23james@email.com(555) 710-109217995zoWugYIcJamesLoganOO7205704461
43john@email.com(555) 877-990536150gbyndpnMJohnLoganyT7409862740
personidattribute1attribute2attribute3
11


23


43


02

Initial Senate Matching login

  1. Type in / copy the address of your Contributor Node into the browser (https://[host name of your Contributor Node]/dashboard)
    1. Your login credentials are created by your organisation when your Contributor Node is configured.
    2. If you have any issues logging in, please reach out to support@datarepublic.com
  2. The Dashboard is loaded. It shows:
    1. List of databases. At launch, this will show two databases (one for production, and one for testing).
    2. System status. Covers status of the node itself ("is the database working?"), connectivity between the contributor and the Senate Matching network, and status of matcher nodes.


03

Database Management Screen

  1. Select a token database on the left to load the file (e.g. Production database)
  2. The database page is loaded, and shows:
    1. A database summary panel, showing
      • Number of tokens (should reflect the number of customers in the database)
      • Download tokens to download the customer ID to token mapping
      • Last updated date (other stats coming soon)
      • Download data template provides a blank CSV which shows the format to use for uploading
    2. Middle panel is for uploading a CSV to update or create new customer records
    3. Right side panel is a history of recent uploads. Shows how many records were uploaded, what percentage of records had complete values

04

Tokenization process of PII Data

To prepare your dataset for upload and matching in Senate you will first need to tokenize the PII (email address / phone / nationalid, etc.) for each customer, as no PII is permitted on the platform. The process of tokenization will return a token value for each customer record.

To tokenize your PII data and receive tokens for each customer record, a file containing person IDs and other PII will need to be loaded into the token database via your organization’s Contributor node. The ‘personid’ should uniquely identify the person within the token database (this ID usually comes from your CRM or data lake). Once tokens are generated, they can be appended to your table containing the non-identifiable transactional or demographic information for each customer. Your tokenized dataset therefore does not contain any PII data and can be uploaded to Senate and matched with other tokenized datasets in governed spaces on the platform.

  1. Prepare your data for upload
    1. Click the Download data template button and save the blank CSV (Note: the template doesn't contain multi-valued fields, each field appears there only once - that won't affect the matching process by multi-valued fields)
    2. Use the template to prepare your PII data for upload. The template contains the following columns: ‘personid’, ‘email, phonedpidnationalidfamily_name, and given_name
    3. When using the template:

      • Do not change the file headers or add any additional columns as this will cause the upload to fail (unless you need to add multi-valued fields, for example: if user has 3 email addresses, you'd need to do: email:0’, email:1’, email:2) - see example here
      • All uppercase will automatically be converted to lower case, and leading or trailing spaces are removed once the file is uploaded

    4. Make sure the the Upload Schema validation rules are obeyed

    5. Populate your data in the template and save the CSV
  2. Upload your prepared CSV file of PII records (see details in section below)
  3. A green progress bar will show the file being uploaded and processed
  4. The status panel will have a new entry, with today's date. Wait a few seconds and will update saying it processed how many records
  5. During the tokenization process, a random token is generated for each individual. Tokens are 64-bit integers displayed in a hexadecimal format. PII values are hashed, sliced and distributed to Matcher Nodes
  6. Once the tokenization process is complete, the Database summary panel will update to show today’s date, and the total number of tokens in the database

Notes:

  • There is a currently an upper limit of 1 million records per CSV file when using the browser interface
  • Your Contributor Node also supports an API for automating data updates


Upload Schema - validation rules:


Field NameTypePIDescription
1personidString (varchar)
NO
  • Can be any string of 100 UTF-8 characters or less
  • Should uniquely identify the person within the token database (this ID usually comes from your CRM or data lake)
  • Once tokens are generated, they can be appended to your table containing the non-identifiable transactional or demographic information for each customer. Your tokenized dataset therefore does not contain any PII data and can be uploaded to Senate and matched with other tokenized datasets in governed spaces on the platform
2emailString (varchar)
YES
  • Person's email address
3phoneNumeric
YES
  • Person's phone number
  • In case this field contains non-numeric characters, such as dash / brackets: (850) 623-2565 it goes through normalization (special characters get removed) before processing
4dpidNumeric
YES
  • Unique number which is allocated to each address maintained in Australia Post’s National Address File. The DPID is the key component of the printed barcode that makes the barcode unique
5nationalidString (varchar)
YES
  • National ID / Social Security Number - is a unique identifier of each citizen in a country
  • The number appears on identity documents issued by several of the countries
6family_name String (varchar)
YES
  • Family Name of a person
7given_nameString (varchar)
YES
  • First Name of a person
8frequent_flyer_numberString (varchar)YES
  • Person's Frequent Flyer Number (if exists)
05

How to upload the CSV file (middle panel)

  1. Middle panel, use this to:
    • Upload a data file: drag files to attach, or browse and select a CSV to create new customer records and tokens.
    • Select the field separator, record separator, and quote character used in your file. Note: any file subsequently loaded to the same database, should be in the same format.
  2. Database summary shows:
    • Total number of tokens: this should reflect the number of people in the database.
    • Download tokens button: to download the person ID and token for each person.
    • Last updated: the date of the last file upload.
    • Download data template: provides a blank CSV which shows the format to use for uploading.
  3. Recent data file uploads shows:
    • How many records were uploaded (duplicates will be ignored and removed).

    • The percentage of records with complete values in a valid format.

06

Download tokens & join to attributes

  1. Click the 'Download Tokens' button in the database summary panel (on left hand side)
  2. Save the resulting file when prompted
  3. The file contains two columns:
    • Your original customer ID (called "personid")
    • Tokens in place of PII values (email, phone, dpid, nationalid, family_name, given_name)

This mapping file is then used to attach the token only to customer attribute data, which is then loaded into Senate.

  • You can do this using the "personId" field, which will be consistent with the "personid"s you've provided prior to tokenization, using whichever method you prefer - many custodians choose to use SQL, excel, or a scripting language to do this.
  • Your resulting file should drop the "personId" column, as this should not be uploaded to Senate.

Append tokens to your attribute data table - example:

tokenattribute1attribute2attribute3












07

Upload tokenized attribute data (Senate)

  1. Log in to Senate for your region
  2. Upload of data - Navigate to the ‘Manage Data’ screen in Senate to manage your data. Click on the Files tab to access SFTP details for files larger than 100MB, or use ‘drag and drop’ for files less than 100MB. Upload the tokenized attribute file you created in step 4 to a directory of your choice
  3. Create a database - Go to the Databases tab and create a database for your data to reside under. It helps to use a descriptive name which indicates what kind of data you'll be uploading
  4. Create tables - Create a table, Load data into the table, and check the data load was successful
  5. Create a view - This is optional, if you want to share a subset of the table only. For example, you can share a view rather than the whole table in a project
  6. Create a data package - Create a data package to exchange data in a Project or Data Listing via the Data Republic Catalog. You can add files, tables or views you want to share into a data package. On the Package Edit screen, use the Edit Token Link to tell Senate which token database generated the tokens in this package. This tells Senate Matching which set of PI it will be using when matching (see more details in the upcoming sections)
  7.  Create a Contributor term sheet - This is an agreement between your organization and Data Republic. A term sheet must be created for each database and should outline how your database can be shared on the Platform

08

Configuring Data Packages for matching


Once a Data Package has been created, it will need to be configured for matching in Senate. This process involves linking the table and column that contains your tokens to the token database that generated the tokens. The Package containing tokens can then be added to a Listing or a data matching Project on Senate.


To configure a data package for matching:


  1. Select Manage Data from the main menu
  2. Click Packages to view a list of packages created
  3. Click View or Edit Package to configure the contents of the package
  4. Contents of the data package are displayed, including any tables, views or files.
  5. Click Link token database (see more details in the next section)

09

Link Token Database

  1. Select the table or view in your package that has the tokens
  2. Select the name of the column that contains the tokens
  3. Select which token database issued the tokens – if you’re unsure, check with the user in your organization that is responsible for preparing your dataset for matching
  4. Click ‘Save’


10

Package successfully updated

  1. A message appears to let you know the package update is successful. Your package has been configured for matching

  2. The button at the bottom of the Packages screen will also change from ‘Link token database’ to ‘Edit token link’

  3. From the main Packages screen, an ‘M’ icon will also appear next to packages that have been configured for matching

11

Editing the Token Link

If you have linked your tokens in your package to the wrong token database, you can unlink and re-link the tokens to the correct token database:

  1. Click the ‘Edit token link’ button on the packages screen
  2. Click ‘Unlink’ on the window that appears
  3. Select the correct database that issued the tokens and click ‘Save’