Getting Started

This article is a quick guide that explains how to work with GraphAr C++. To begin with, please refer to the Building Steps to install GraphAr.

Information Files

GraphAr uses a group of Yaml files to save the meta information for a graph.

Graph information

The graph information file defines the most basic information of a graph includes its name, the root directory path of the data files, the vertex information and edge information files it contains, and the version of GraphAr. For example, the file "ldbc_sample.graph.yml" defines an example graph named "ldbc_sample", which includes one type of vertices ("person") and one type of edges ("person knows person").

name: ldbc_sample
prefix: ./
vertices:
  - person.vertex.yml
edges:
  - person_knows_person.edge.yml
version: gar/v1

Vertex information

Each vertex information file defines a single group of vertices with the same vertex label, e.g., "person" in this case. The vertex chunk size, the relative path for vertex data files and the version of GraphAr are specified. These vertices could have some properties, which are divided into property groups. Each property group has its own file type (CSV, ORC or Parquet) and the prefix of the relative path for its data files, it also lists all properties in this group, with every property contains the name, data type and if it is the primary key.

The file person.vertex.yml located inside the test data contains an example of the vertex information file. In this example, the "person" vertices have two property groups. The first group contains only one property (named "id") and the second group contains three properties ("firstName", "lastName" and "gender").

Edge information

Each edge information file defines a single type of edges with specific labels for the source vertex, destination vertex and the edge, e.g., "person_knows_person" in this case. It defines the meta information such as the edge chunk size, the source vertex chunk size, the destination vertex chunk size, if the edges are directed or not, the relative file path for edge data files, the adjLists and the version of GraphAr. The file person_knows_person.edge.yml located inside the test data contains an example of the edge information file.

In GraphAr format, separate data files are used to store the structure (called adjList) and the properties for edges. The adjList type can be either of unordered_by_source, unordered_by_dest, ordered_by_source or ordered_by_dest. For a specific type of adjList, the meta information includes its file path prefix, the file type, as well as all the property groups attached.

note

It is allowed to store different types of adjLists for a group of edges at the same time.

Data Files

Property data

The vertex properties are stored in vertex property chunks with the chunk size specified by the vertex information file. Different property groups correspond to individual groups of data files. In our example, the property group ("first name", "last name", "gender") for vertex chunk 0 of "person" vertices are stored in ./vertex/person/firstName_lastName_gender/chunk0.

In practice of graph processing, it is common to only query a subset of columns of the properties. Thus, the column-oriented formats like Apache ORC and Apache Parquet are more efficient, which eliminate the need to read columns that are not relevant. We also provide data files in ORC and Parquet for the example graph in the test data.

Similar with vertices, the edge properties are stored in edge property chunks. For each vertex chunk, its associated edges (if the edge type is ordered_by_source or unordered_by_source, the associated edges are those in which the source vertex is in that chunk; otherwise, if the edge type is ordered_by_dest or unordered_by_dest, the associated edges are those in which the destination is in that chunk) are maintained in some edge chunks, with the size of each chunk not exceeding the edge chunk size specified in the edge information file.

For instance, the file ./edge/person_knows_person/ordered_by_source/creationDate/part0/chunk0 stores the property group "creationDate" of "person_knows_person" edges for the first edge chunk of the first vertex chunk, and the adjList type of the edges is ordered_by_source.

AdjList data

The adjList in GraphAr describes the topology structure, i.e., the internal id of the source vertex and the destination vertex for each of a group of edges. As explained in Edges in GraphAr, the edges are separated into edge chunks, and each edge chunk has its own adjList table and 0 or more property tables.

For example, the file ./edge/person_knows_person/ordered_by_source/adj_list/part0/chunk0 saves the adjList of "person_knows_person" edges for the first edge chunk of the first vertex chunk, and the adjList type of the edges is "ordered_by_source". This adjList table consists of only two columns: one for the source and one for the destination; it can be saved in CSV, ORC, or Parquet files.

note

If the edges are ordered, there may also be offset chunks to construct the index for accessing edges of a single vertex. These chunks will tore the start offset of each vertex's edges, see ./edge/person_knows_person/ordered_by_source/offset/chunk0 as an example.

How to Use GraphAr

Construct information

It is convenient to construct the graphar metadata and dump it to generate information files. We provide an example program located in the source code which shows how to construct and dump the files for graph information, vertex information and edge information.

Also, the metadata of a graph can be constructed easily through reading the already existed information files, as the following code illustrates:

// construct graph information from file
std::string path = ... // the path of the graph information file (e.g., ldbc_sample.graph.yml)
auto graph_info = graphar::GraphInfo::Load(path).value();

// get vertex information
auto vertex_info = graph_info->GetVertexInfo("person");
if (vertex_info != nullptr) {
  // use vertex_info ...
}

// get edge information
auto edge_info = graph_info->GetEdgeInfo("person", "knows", "person");
if (edge_info != nullptr) {
  // use edge_info ...
}

Read GraphAr format files

GraphAr supports the flexible reading of graph data, e.g., allowing to read data of a single vertex, a vertex chunk, or all vertices with a specific label. In addition, necessary property groups can be selected to read and avoid reading all properties from the files. Furthermore, GraphAr provides convenient and flexible access to adjList, offset and property chunks for edges.

As a simple case, the following example shows how to read all vertices with label "person" of the graph defined by "graph_info" and output the values of "id" and "firstName" for each vertex.

graph_info = ...
auto vertices = graphar::VerticesCollection::Make(graph_info, "person").value();

for (auto it = vertices->begin(); it != vertices->end(); ++it) {
  // get a vertex and access its data
  auto vertex = *it;
  if (vertex.IsValid("id") && vertex.IsValid("firstName"))
    std::cout << "id=" << vertex.property<int64_t>("id").value() << ", firstName=" << vertex.property<std::string>("firstName").value() << std::endl;
}

The next example reads all edges with label "person_knows_person" from the above graph and outputs the end vertices for each edge.

graph_info = ...
auto expect = graphar::EdgesCollection::Make(graph_info, "person", "knows", "person", graphar::AdjListType::ordered_by_source);
auto edges = expect.value();

for (auto it = edges->begin(); it != edges->end(); ++it) {
  // get an edge and access its data
  auto edge = *it;
  std::cout << "src=" << edge.source() << ", dst=" << edge.destination() << std::endl;
}

Write GraphAr format files

As same with the readers, the GraphAr writers provide different-level methods to output the graph data in memory into GraphAr format files.

As the simplest cases, the fist example below adds vertices to VerticesBuilder, and then dumps the data to files; the second example constructs a collection of edges and then dumps them.

vertex_info = ...
prefix = ...
graphar::builder::VerticesBuilder builder(vertex_info,  prefix);

// add a vertex
graphar::builder::Vertex v;
v.AddProperty("id", 933);
v.AddProperty("firstName", "Alice");
builder.AddVertex(v);
// add other vertices
// ...

// write to GraphAr format files
builder.Dump();

edge_info = ...
prefix = ...
vertices_num = ...
graphar::builder::EdgesBuilder builder(edge_info, prefix, graphar::AdjListType::ordered_by_source, vertices_num);

// add an edge (0 -> 3)
graphar::builder::Edge e(0, 3);
e.AddProperty("creationDate", "2011-07-20T20:02:04.233+0000");
builder.AddEdge(e);
// add other edges
// ...

// write to GraphAr format files
builder.Dump();

A PageRank Example

Here we will go through an example of out-of-core graph analytic algorithms based on GraphAr which calculates the PageRank. Please look here if you want a detailed explanation of the PageRank algorithm. And the source code can be found at pagerank_example.cc.

This program first reads in the graph information file to obtain the metadata; then, it constructs the vertex and edge collections to enable access to the graph. After that, an implementation of the PageRank algorithm is provided, with data for the vertices stored in memory, and the edges streamed through disk I/O. Finally, the vertex information with type "person" is extended to include a new property named "pagerank" (a new vertex information file named person-new-pagerank.vertex.yml is saved) and the VerticesBuilder is used to write the results to new generated data chunks.

Please refer to more examples to learn about the other available case studies utilizing GraphAr.

Processing Graph Data with Labels

As GraphAr supports LPG data, users can add labels for vertices and use related label filtering functions to obtain specified vertices. The standard csv format of graph data with labels supported by GraphAr is id|:LABEL|property_1|property_2, if a vertex has multiple labels, use ; to separate them. Here is an example.

id|:LABEL|name|url
0|company;public|Kam_Air|http://dbpedia.org/resource/Kam_Air
1|company|Balkh_Airlines|http://dbpedia.org/resource/Balkh_Airlines
2|company|Khyber_Afghan_Airlines|http://dbpedia.org/resource/Khyber_Afghan_Airlines
...
1575|university|Paktia_University|http://dbpedia.org/resource/Paktia_University

At the beginning, we need to add label information into the yml file of vertex. For this example, it should be as follows.

# organisation.vertex.yml

type: organisation
chunk_size: 4096
prefix: vertex/organisation/
# add the label infos
labels:
  - university
  - company
  - public
property_groups:
  - file_type: parquet
    properties:
      - name: name
        data_type: string
        is_primary: false
      - name: url
        data_type: string
        is_primary: false
      - name: id
        data_type: int64
        is_primary: true
version: gar/v1

When you have the data ready, you can read the file into arrow::Table by using arrow I/O function.

    arrow::csv::ReadOptions read_options{}; 
    arrow::csv::ParseOptions parse_options{}; 
    arrow::csv::ConvertOptions convert_options{};

    parse_options.delimiter = '|'; 

    auto input = arrow::io::ReadableFile::Open(test_data_dir + "/ldbc/organisation_0_0.csv", arrow::default_memory_pool()).ValueOrDie();

    auto reader = arrow::csv::TableReader::Make(
        arrow::io::default_io_context(),
        input,
        read_options,
        parse_options,
        convert_options).ValueOrDie();

    std::shared_ptr<arrow::Table> table;
    table = reader->Read().ValueOrDie();

You can export label table to disk in parquet format, and read it back into memory in the following way.

  auto options_parquet_builder = WriterOptions::ParquetOptionBuilder();
  options_parquet_builder.compression(arrow::Compression::type::UNCOMPRESSED);
  // write arrow table as parquet chunk
  auto maybe_writer =
      VertexPropertyWriter::Make(vertex_info, test_data_dir + "/ldbc/parquet/", options_parquet_builder.build());
  REQUIRE(!maybe_writer.has_error());
  auto writer = maybe_writer.value();
  REQUIRE(writer->WriteTable(table, 0).ok());
  REQUIRE(writer->WriteVerticesNum(table->num_rows()).ok());

  // read parquet chunk as arrow table
  auto maybe_reader =
      VertexPropertyArrowChunkReader::MakeForLabels(graph_info, "organisation", labels, SelectType::LABELS);
  assert(maybe_reader.status().ok());
  auto reader = maybe_reader.value();
  assert(reader->seek(0).ok());
  assert(reader->GetLabelChunk().status().ok());
  assert(reader->next_chunk().ok());

Using Label Filtering Functions

By calling the graphar::VerticesCollection::verticesWithLabel or graphar::VerticesCollection::verticesWithMultipleLabels API, we can specify a certain type of vertices on a certain graph, then filter out all vertices that match one or more labels. Here we introduce several examples of using label filtering.

  graph_info = ...
  auto vertex_info = graph_info->GetVertexInfo("organisation");
  auto labels = vertex_info->GetLabels();

  // query vertices with a specific label
  auto maybe_filter_vertices_collection =
      graphar::VerticesCollection::verticesWithLabel(std::string("company"), graph_info, type);
  ASSERT(!maybe_filter_vertices_collection.has_error());
  auto filter_vertices = maybe_filter_vertices_collection.value();
  
  // iterate vertices with label "company"
  for (auto it = filter_vertices->begin(); it != filter_vertices->end();
       ++it) {
    // get a node's all labels
    auto label_result = it.label();
    std::cout << "id: " << it.id() << " ";
    if (!label_result.has_error()) {
      for (auto label : label_result.value()) {
        std::cout << label << " ";
      }
    }
    // ...
  }

  // query vertices based on a query result
  auto maybe_filter_vertices_collection =
      graphar::VerticesCollection::verticesWithLabel(std::string("public"),
                                                     filter_vertices);

  // query vertices with multi labels
  auto maybe_filter_vertices_collection =
    graphar::VerticesCollection::verticesWithMultipleLabels({"company", "public"}, graph_info, type);
  // ...

 

Notice that, if the first two queries are executed successively, the result is equivalent to the third query.

Working with Cloud Storage (S3, OSS)

GraphAr supports reading and writing data from and to cloud storage, including AWS S3 and Alibaba Cloud OSS.

To read data from cloud storage, you can specify the path of the data files with URI schema, e.g., "s3://bucket-name/path/to/data" or "s3://[access-key:secret-key]@bucket-name/path/to/data".

Code example demonstrates how to read data from S3.

Note that once you use cloud storage, you need to call graphar::InitializeS3 to initialize S3 APIs before starting the work and callgraphar::FinalizeS3() to shut down the APIs after the work finish.

Getting Started

Information Files​

Graph information​

Vertex information​

Edge information​

Data Files​

Property data​

AdjList data​

How to Use GraphAr​

Construct information​

Read GraphAr format files​

Write GraphAr format files​

A PageRank Example​

Processing Graph Data with Labels​

Using Label Filtering Functions​

Working with Cloud Storage (S3, OSS)​