Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions pages/data-migration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ instance. Whether your data is structured in files, relational databases, or
other graph databases, Memgraph provides the flexibility to integrate and
analyze your data efficiently.

Memgraph supports file system imports like CSV files, offering efficient and
Memgraph supports file system imports like Parquet and CSV files, offering efficient and
structured data ingestion. **However, if you want to migrate directly from
another data source, you can use the [`migrate`
module](/advanced-algorithms/available-algorithms/migrate)** from Memgraph MAGE
Expand All @@ -31,6 +31,11 @@ In order to learn all the pre-requisites for importing data into Memgraph, check

## File types

### Parquet files

Parquet files can be imported efficiently from the local disk and from s3:// using the
[LOAD PARQUET clause](/querying/claused/load-parquet).

### CSV files

CSV files provide a simple and efficient way to import tabular data into Memgraph
Expand Down Expand Up @@ -262,4 +267,4 @@ nonsense or sales pitch, just tech.
/>
</Cards>

<CommunityLinks/>
<CommunityLinks/>
1 change: 1 addition & 0 deletions pages/data-migration/_meta.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
export default {
"best-practices": "Best practices",
"csv": "CSV",
"parquet": "PARQUET",
"json": "JSON",
"cypherl": "CYPHERL",
"migrate-from-neo4j": "Migrate from Neo4j",
Expand Down
2 changes: 1 addition & 1 deletion pages/data-migration/best-practices.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -572,4 +572,4 @@ For more information about `Delta` objects, check the
information on the [IN_MEMORY_TRANSACTIONAL storage mode](/fundamentals/storage-memory-usage#in-memory-transactional-storage-mode-default).


<CommunityLinks/>
<CommunityLinks/>
252 changes: 252 additions & 0 deletions pages/data-migration/parquet.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
---
title: Import data from Parquet files
description: Leverage Parquet files in Memgraph operations. Our detailed guide simplifies the process for an enhanced graph computing journey.
---

import { Callout } from 'nextra/components'
import { Steps } from 'nextra/components'
import { Tabs } from 'nextra/components'

# Import data from Parquet file

The data from Parquet files can be imported using the [`LOAD PARQUET` Cypher clause](#load-parquet-cypher-clause) from the local disk
and from the s3.

## `LOAD PARQUET` Cypher clause

The `LOAD PARQUET` clause uses a background thread that reads column batches, assembles batch of 64K rows and puts it on the queue from
where the main thread pulls the data. The main thread then reads row by row from the queue, binds the contents of the parsed row to the
specified variable, populates the database if it is empty or appends new data to an existing dataset.

### `LOAD PARQUET` clause syntax

</Callout>

The syntax of the `LOAD PARQUET` clause is:

```cypher
LOAD PARQUET FROM <parquet-location> ( WITH CONFIG configs=configMap ) ? AS <variable-name>
```

- `<parquet-location>` is a string of the location of the Parquet file.<br/> Without a
s3:// prefix, it refers to a path on the local and with s3:// prefix, it pulls the file with specified URI from the S3-compatible storage.
There are no restrictions on where in
your file system the file can be located, as long as the path is valid (i.e.,
the file exists). If you are using Docker to run Memgraph, you will need to
[copy the files from your local directory into
Docker](/getting-started/first-steps-with-docker#copy-files-from-and-to-a-docker-container)
container where Memgraph can access them. <br/>

- `<configs>` Represents an optional configuration map through which you can specify configuration options: `aws_region`, `aws_access_key`, `aws_secret_key` and `aws_endpoint_url`.
- `<aws_region>`: The region in which your S3 service is being located
- `<aws_access_key>`: Access key used to connect to S3 service
- `<aws_secret_key>`: Secret key used to connect S3 service
- `<aws_endpoint_url`>: Optional configuration parameter. Can be used to set the URL of the S3 compatible storage.
- `<variable-name>` is a symbolic name representing the variable to which the
contents of the parsed row will be bound to, enabling access to the row
contents later in the query. The variable doesn't have to be used in any
subsequent clause.

### `LOAD PARQUET` clause specificities

When using the `LOAD PARQUET` clause please keep in mind:

- The parser parses the values in their appropriate type so you should get the same type as in the Parquet file. Types `BOOL`, `INT8`, `INT16`, `INT32`, `INT64`, `UINT8`, `UINT16`, `UINT32`, `UINT64`,
`HALF_FLOAT`, `FLOAT`, `DOUBLE`, `STRING`, `LARGE_STRING`, `STRING_VIEW`, `DATE32`, `DATE64`, `TIME32`, `TIME64`, `TIMESTAMP`, `DURATION`, `DECIMAL128`, `DECIMAL256`, `BINARY`, `LARGE_BINARY`, `FIXED_SIZE_BINARY`,
`LIST` and `MAP` are supported. Unsupported types will be saved as string in Memgraph.

- Authentication parameters (`aws_region`, `aws_access_key`, `aws_secret_key` and `aws_endpoint_url`) can be provided in the `LOAD PARQUET` query using WITH CONFIG construct, through environment variables
(`AWS_REGION`, `AWS_ACCESS_KEY`, `AWS_SECRET_KEY` and `AWS_ENDPOINT_URL`) and through run-time database settings. For setting authentication parameters through run-time settings, use `SET DATABASE SETTING <key> to <value>;`
query. Keys of this authentication parameters are `aws.access_key`, `aws.region`, `aws.secret_key` and `aws.endpoint_url`.

- **The `LOAD PARQUET` clause is not a standalone clause**, meaning a valid query
must contain at least one more clause, for example:

```cypher
LOAD PARQUET FROM "/people.parquet" AS row
CREATE (p:People) SET p += row;
```

In this regard, the following query will throw an exception:

```cypher
LOAD PARQUET FROM "/file.parquet" AS row;
```

**Adding a `MATCH` or `MERGE` clause before LOAD PARQUET** allows you to match certain
entities in the graph before running LOAD PARQUET, optimizing the process as
matched entities do not need to be searched for every row in the PARQUET file.

But, the `MATCH` or `MERGE` clause can be used prior the `LOAD PARQUET` clause only
if the clause returns only one row. Returning multiple rows before calling the
`LOAD PARQUET` clause will cause a Memgraph runtime error.

- **The `LOAD PARQUET` clause can be used at most once per query**, so queries like
the one below will throw an exception:

```cypher
LOAD PARQUET FROM "/x.parquet" AS x
LOAD PARQUET FROM "/y.parquet" AS y
CREATE (n:A {p1 : x, p2 : y});
```

### Increase import speed

The `LOAD PARQUET` clause will create relationships much faster and consequently
speed up data import if you [create indexes](/fundamentals/indexes) on nodes or
node properties once you import them:

```cypher
CREATE INDEX ON :Node(id);
```

If the LOAD PARQUET clause is merging data instead of creating it, create indexes
before running the LOAD PARQUET clause.


The construct `USING PERIODIC COMMIT <BATCH_SIZE>` also improves the import speed because
it optimizes some of the memory allocation patterns. In our benchmarks, this construct
speeds up the execution from 25% to 35%.

```cypher
USING PERIODIC COMMMIT 1024 LOAD PARQUET FROM "/x.parquet" AS x
CREATE (n:A {p1 : x, p2 : y});
```


You can also speed up import if you switch Memgraph to [**analytical storage
mode**](/fundamentals/storage-memory-usage#storage-modes). In the analytical
storage mode there are no ACID guarantees besides manually created snapshots.
After import you can switch the storage mode back to
transactional and enable ACID guarantees.

You can switch between modes within the session using the following query:

```cypher
STORAGE MODE IN_MEMORY_{TRANSACTIONAL|ANALYTICAL};
```

If you use `IN_MEMORY_ANALYTICAL` mode and have nodes and relationships stored in
separate PARQUET files, you can run multiple concurrent `LOAD PARQUET` queries to import data even faster.
In order to achieve the best import performance, split your nodes and relationships
files into smaller files and run multiple `LOAD PARQUET` queries in parallel.
The key is to run all `LOAD PARQUET` queries which create nodes first. After that, run
all `LOAD PARQUET` queries that create relationships.


### Import multiple Parquet files with distinct graph objects

In this example, the data is split across four files, each file contains nodes
of a single label or relationships of a single type.

<Steps>

{<h3 className="custom-header">Download the files</h3>}

- [`people_nodes.parquet`](https://public-assets.memgraph.com/import-data/load-csv-cypher/multiple-types-nodes/people_nodes.parquet) is used to create nodes labeled `:Person`.<br/> The file contains the following data:
```parquet
id,name,age,city
100,Daniel,30,London
101,Alex,15,Paris
102,Sarah,17,London
103,Mia,25,Zagreb
104,Lucy,21,Paris
```
- [`restaurants_nodes.parquet`](https://public-assets.memgraph.com/import-data/load-csv-cypher/multiple-types-nodes/restaurants_nodes.parquet) is used to create nodes labeled `:Restaurants`.<br/> The file contains the following data:
```parquet
id,name,menu
200,Mc Donalds,Fries;BigMac;McChicken;Apple Pie
201,KFC,Fried Chicken;Fries;Chicken Bucket
202,Subway,Ham Sandwich;Turkey Sandwich;Foot-long
203,Dominos,Pepperoni Pizza;Double Dish Pizza;Cheese filled Crust
```

- [`people_relationships.parquet`](https://public-assets.memgraph.com/import-data/load-csv-cypher/multiple-types-nodes/people_relationships.parquet) is used to connect people with the `:IS_FRIENDS_WITH` relationship.<br/> The file contains the following data:
```parquet
first_person,second_person,met_in
100,102,2014
103,101,2021
102,103,2005
101,104,2005
104,100,2018
101,102,2017
100,103,2001
```
- [`restaurants_relationships.parquet`](https://public-assets.memgraph.com/import-data/load-csv-cypher/multiple-types-nodes/restaurants_relationships.parquet) is used to connect people with restaurants using the `:ATE_AT` relationship.<br/> The file contains the following data:
```parquet
PERSON_ID,REST_ID,liked
100,200,true
103,201,false
104,200,true
101,202,false
101,203,false
101,200,true
102,201,true
```

{<h3 className="custom-header">Check the location of the Parquet files</h3>}
If you are working with Docker, [copy the files from your local directory into
the Docker container](/getting-started/first-steps-with-docker#copy-files-from-and-to-a-docker-container)
so that Memgraph can access them.

{<h3 className="custom-header">Import nodes</h3>}

Each row will be parsed as a map, and the
fields can be accessed using the property lookup syntax (e.g. `id: row.id`).

The following query will load row by row from the file, and create a new node
for each row with properties based on the parsed row values:

```cypher
LOAD PARQUET FROM "/path-to/people_nodes_wh.parquet" AS row
CREATE (n:Person {id: row.id, name: row.name, age: row.age, city: row.city});
```

In the same manner, the following query will create new nodes for each restaurant:

```cypher
LOAD PARQUET FROM "/path-to/restaurants_nodes.parquet" AS row
CREATE (n:Restaurant {id: row.id, name: row.name, menu: row.menu});
```

{<h3 className="custom-header">Create indexes</h3>}

Creating an [index](/fundamentals/indexes) on a property used to connect nodes
with relationships, in this case, the `id` property of the `:Person` nodes,
will speed up the import of relationships, especially with large datasets:

```cypher
CREATE INDEX ON :Person(id);
```

{<h3 className="custom-header">Import relationships</h3>}
The following query will create relationships between the people nodes:

```cypher
LOAD PARQUET FROM "/path-to/people_relationships.parquet" AS row
MATCH (p1:Person {id: row.first_person})
MATCH (p2:Person {id: row.second_person})
CREATE (p1)-[f:IS_FRIENDS_WITH]->(p2)
SET f.met_in = row.met_in;
```

The following query will create relationships between people and restaurants where they ate:

```cypher
LOAD PARQUET FROM "/path-to/restaurants_relationships.parquet" AS row
MATCH (p1:Person {id: row.PERSON_ID})
MATCH (re:Restaurant {id: row.REST_ID})
CREATE (p1)-[ate:ATE_AT]->(re)
SET ate.liked = ToBoolean(row.liked);
```

{<h3 className="custom-header">Final result</h3>}
Run the following query to see how the imported data looks as a graph:

```
MATCH p=()-[]-() RETURN p;
```

![](/pages/data-migration/csv/load_csv_restaurants_relationships.png)

</Steps>
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ of the following commands:
| Privilege to enforce [constraints](/fundamentals/constraints). | `CONSTRAINT` |
| Privilege to [dump the database](/configuration/data-durability-and-backup#database-dump).| `DUMP` |
| Privilege to use [replication](/clustering/replication) queries. | `REPLICATION` |
| Privilege to access files in queries, for example, when using `LOAD CSV` clause. | `READ_FILE` |
| Privilege to access files in queries, for example, when using `LOAD CSV` and `LOAD PARQUET` clauses. | `READ_FILE` |
| Privilege to manage [durability files](/configuration/data-durability-and-backup#database-dump). | `DURABILITY` |
| Privilege to try and [free memory](/fundamentals/storage-memory-usage#deallocating-memory). | `FREE_MEMORY` |
| Privilege to use [trigger queries](/fundamentals/triggers). | `TRIGGER` |
Expand Down
17 changes: 17 additions & 0 deletions pages/database-management/configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,10 @@ fallback to the value of the command-line argument.
| hops_limit_partial_results | If set to `true`, partial results are returned when the hops limit is reached. If set to `false`, an exception is thrown when the hops limit is reached. The default value is `true`. | yes |
| timezone | IANA timezone identifier string setting the instance's timezone. | yes |
| storage.snapshot.interval | Define periodic snapshot schedule via cron expression ([crontab](https://crontab.guru/) format, an [Enterprise feature](/database-management/enabling-memgraph-enterprise)) or as a period in seconds. Set to empty string to disable. | no |
| aws.region | AWS region in which your S3 service is located. | yes |
| aws.access_key | Access key used to READ the file from S3. | yes |
| aws.secret_key | Secret key used to READ the file from S3. | yes |
| aws.endpoint_url | URL on which S3 can be accessed (if using some other S3-compatible storage). | yes |

All settings can be fetched by calling the following query:

Expand Down Expand Up @@ -481,6 +485,19 @@ connections in Memgraph.
| `--stream-transaction-retry-interval=500` | The interval to wait (measured in milliseconds) before retrying to execute again a conflicting transaction. | `[uint32]` |


### AWS

This section contains the list of flags that are used when connecting to S3-compatible storage.


| Flag | Description | Type |
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------|------------|
| `--aws-region` | AWS region in which your S3 service is located. | `[string]` |
| `--aws-access-key` | Access key used to READ the file from S3. | `[string]` |
| `--aws-secret-key` | Secret key used to READ the file from S3. | `[string]` |
| `--aws-endpoint-url` | URL on which S3 can be accessed (if using some other S3-compatible storage). | `[string]` |


### Other

This section contains the list of all other relevant flags used within Memgraph.
Expand Down
10 changes: 5 additions & 5 deletions pages/help-center/faq.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -212,11 +212,11 @@ us](https://memgraph.com/enterprise-trial) for more information.

### What is the fastest way to import data into Memgraph?

Currently, the fastest way to import data is from a CSV file with a [LOAD CSV
clause](/data-migration/csv). Check out the [best practices for importing
Currently, the fastest way to import data is from a Parquet file with a [LOAD PARQUET
clause](/data-migration/parquet). Check out the [best practices for importing
data](/data-migration/best-practices).

[Other import methods](/data-migration) include importing data from JSON and CYPHERL files,
[Other import methods](/data-migration) include importing data from CSV, JSON and CYPHERL files,
migrating from relational databases, or connecting to a data stream.

### How to import data from MySQL or PostgreSQL?
Expand All @@ -226,11 +226,11 @@ You can migrate from [MySQL](/data-migration/migrate-from-rdbms) or

### What file formats does Memgraph support for import?

You can import data from [CSV](/data-migration/csv),
You can import data from [CSV](/data-migration/csv), [PARQUET](/data-migration/parquet)
[JSON](/data-migration/json) or [CYPHERL](/data-migration/cypherl) files.

CSV files can be imported in on-premise instances using the [LOAD CSV
clause](/data-migration/csv), and JSON files can be imported using a
clause](/data-migration/csv), PARQUET files can be imported using the [LOAD PARQUET](/data-migration/parquet) and JSON files can be imported using a
[json_util](/advanced-algorithms/available-algorithms/json_util) module from the
MAGE library. On a Cloud instance, data from CSV and JSON files can be imported only
from a remote address.
Expand Down
6 changes: 5 additions & 1 deletion pages/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,10 @@ JSON files, and import data using queries within a CYPHERL file.
title="JSON"
href="/data-migration/json"
/>
<Cards.Card
title="PARQUET"
href="/data-migration/parquet"
/>
<Cards.Card
title="CYPHERL"
href="/data-migration/cypherl"
Expand Down Expand Up @@ -337,4 +341,4 @@ Ensure alignment with the latest updates and changes.
/>
</Cards>

<CommunityLinks/>
<CommunityLinks/>
Loading