How to generate sample RDF data

Created by Steve Place, Modified on Mon, Feb 5, 2024 at 12:10 PM by Steve Place

You can generate sample RDF data using bsbmtools. (Read more about BSBM, the Berlin SPARQL Benchmark, here.)

Prerequisites

Checkout and build bsbmtools

This section is an unofficial update of this section of the blazegraph/database repo, given the latter is out of date. Stardog has no affiliation with this repo.

Run the following commands:

svn checkout svn://svn.code.sf.net/p/bsbmtools/code/trunk bsbmtools-code
cd bsbmtools/trunk
ant

Before running ant, update the build.xml file in bsbmtools/trunk to change the following lines:

<property name="java.source"      value="1.6"/>
<property name="java.target"      value="1.6"/>

to:

<property name="java.source"      value="1.8"/>
<property name="java.target"      value="1.8"/>

Without this change, ant will be expecting Java 6. Your system has Java 11 installed (since this is required for running Stardog), so making the above change will make ant expect Java 8 (which is compatible with Java 11).

Generate a dataset

This section follows this section of the blazegraph/database repo.

Run the following commands to generate a (roughly) 100 million triple dataset:

 mkdir td_100m
 ./generate -fc -pc 284826 -fn td_100m/dataset -dir td_100m/td_data
 gzip td_100m/dataset.nt

100 million triples generated this way is roughly 24GB. To generate a larger/smaller dataset, change the number in the -pc flag by the corresponding multiple. In other words, to generate a 200M triple dataset, double the number used with -pc:

 mkdir td_200m
 ./generate -fc -pc 566496 -fn td_200m/dataset -dir td_200m/td_data
 gzip td_200m/dataset.nt

To generate a 10M triple dataset, divide the number used with -pc by 10:

 mkdir td_10m
 ./generate -fc -pc 28483 -fn td_10m/dataset -dir td_10m/td_data
 gzip td_100m/dataset.nt

The corresponding file generated will change by roughly the same multiple (e.g., a 10M triple dataset will be about 2.5GB).