

# **PMDK ESSENTIALS**

Andy Rudoff (Intel Data Center Group) September 5<sup>th</sup>, 2019



- Persistent Memory Concepts
- Operating System Essentials
- The PMDK Libraries
- Flushing, Transactions, Allocation
- Language Support
- Comparing High and Low Level Languages





# PERSISTENT MEMORY CONCEPTS

# THE STORAGE STACK (50,000FT VIEW...)





# A Programmer's View

(not just C programmers!)

```
fd = open("/my/file", O RDWR);
...
count = read(fd, buf, bufsize);
...
count = write(fd, buf, bufsize);
...
close(fd);
```

"Buffer-Based"



## A Programmer's View (mapped files)

```
fd = open("/my/file", O_RDWR);
```

```
base[100] = 'X';
```

```
strcpy(base, "hello there");
```

```
*structp = *base structp;
```

"Load/Store"

...

...



## **MEMORY-MAPPED FILES**

What are memory-mapped files really?

Direct access to the page cache

Storage only supports block access (paging)
 With load/store access, when does I/O happen?

- Read faults/Write faults
- Flush to persistence

Not that commonly used or understood

#### Quite powerful

Sometimes used without realizing it

Good reference: http://nommu.org/memory-



## **OS PAGING**





### NVDIMM-N









Source – Intel-tested: Average read latency measured at queue depth 1 during 4k random write workload. Measured using FIO 3.1. Common Configuration – Intel 2U Server System, OS CentOS 7.5, kernel 4.17.6-1.el7.x86\_64, CPU 2 x Intel® Xeon® 6154 Gold @ 3.0GHz (18 cores), RAM 256GB DDR 4 @ 2666MHz. Configuration – Intel® Optane<sup>™</sup> SSD DC P4800X 375GB and Intel® SSD DC P4600 1.6TB. Latency – Average read latency measured at QD1 during 4K Random Write operations using FIO 3.1. Intel Microcode: 0x2000043; System BIOS: 00.01.0013; ME Firmware: 04.00.04.294; BMC Firmware: 1.43.91f76955; FRUSDR: 1.43. SSDs tested were commercially available at time of test. The benchmark results may need to be revised as additional testing is conducted. Performance results are based on testing as of July 24, 2018 and may not reflect all publicly vailable security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are neasured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including he performance of that product when combined with other products. For more complete information visit <u>www.intel.com/benchmarks</u>.

#### SPDK, PMDK & Vtune<sup>™</sup> Summit



#### MOTIVATION FOR THE PM PROGRAMMING MODEL?

#### Idle Average Random Read Latency<sup>1</sup>



<sup>1</sup> Source – Intel-tested: Average read latency measured at queue depth 1 during 4k random write workload. Measured using FIO 3.1. Common Configuration – Intel 2U Server System, OS CentOS 7.5, kernel 4.17.6-1.el7.x86\_64, CPU 2 x Intel® Xeon® 6154 Gold @ 3.0GHz (18 cores), RAM 256GB DDR4 @ 2666MHz. Configuration – Intel® Optane<sup>™</sup> SSD DC P4800X 375GB and Intel® SSD DC P4600 1.6TB. Latency – Average read latency measured at QD1 during 4K Random Write operations using FIO 3.1. Intel Microcode: 0x2000043; System BIOS: 00.01.0013; ME Firrnware: 04.00.04.294; BMC Firrnware: 1.43.91f76955; FRUSDR: 1.43. SSDs tested were commercially available at time of test. The benchmark results may need to be revised as additional testing is conducted. Performance results are based on testing as of July 24, 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <u>www.intel.com/benchmarks</u>.

SPDK, PMDK & Vtune<sup>™</sup> Summit





Idle Average Random Read



# **THE VALUE OF PERSISTENT MEMORY**

Data sets addressable with no DRAM footprint

At least, up to application if data copied to DRAM

Typically DMA (and RDMA) to PM works as expected

RDMA directly to persistence – no buffer copy required!

The "Warm Cache" effect

No time spend loading up memory

Byte addressable

Direct user-mode access

No kernel code in data path



#### THE SNIA NVM PROGRAMMING MODEL





#### THE PROGRAMMING MODEL BUILDS ON THE STORAGE APIS





#### THE PROGRAMMING MODEL BUILDS ON THE STORAGE APIS





#### **OPTIMIZED FLUSH IS THE PRIMARY NEW API**





## **APPLICATION MEMORY ALLOCATION**



- Well-worn interface, around for decades
- Memory is gone when application exits
  - Or machine goes down



## **APPLICATION NVM ALLOCATION**



- Simple, familiar interface, but then what?
  - Persistent, so apps want to "attach" to regions
  - Need to manage permissions for regions
  - Need to resize, remove, ..., backup the data



## **VISIBILITY VERSUS PERSISTENCE**

It has always been thus:

- open()
- mmap()
- store...



#### pmem just follows this decades-old model

But the stores are cached in a different spot



#### HOW THE HW WORKS

MOV













### **CREATING A PROGRAMMING ENVIRONMENT**





# **OPERATING SYSTEM ESSENTIALS**

## **ENABLING IN THE ECOSYSTEM**

- Linux kernel version 4.19 (ext4, xfs)
- Windows Server 2019 (NTFS)
- VMware vSphere 6.7
- RHEL 7.5
- SLES 15 and SLES 12 SP4
- Ubuntu 18.\*
- Java JDK 12
- Kubernetes 1.13
- OpenStack 'Stein'

See Steve Scargall's Webinar on how to provision Optane DC Persistent Memory: https://software.intel.com/en-us/videos/provisioning-intel-optane-dc-persistent-memory-modules-in-linux



# **PROGRAMMING WITH OPTIMIZED FLUSH**

- Use Standard unless OS says it is safe to use Optimized Flush
- On Windows
  - When you successfully memory map a DAX file:
    - Optimized Flush is safe
- On Linux
  - When you successfully memory map a DAX file with MAP\_SYNC:
    - Optimized Flush is safe
  - MAP\_SYNC flag to mmap() is new



# THE PMDK LIBRARIES

#### **PMDK LIBRARIES**

Language bindings

High Level Interfaces

#### <u>http://pmem.io</u> <u>https://github.com/pmem/pmdk</u>





### **DIFFERENT WAYS TO USE PERSISTENT MEMORY**





### **DIFFERENT WAYS TO USE PERSISTENT MEMORY**





#### MEMORY MODE

#### When To Use

- modifying applications is not feasible
- massive amounts of memory is required (more TB)

 $\succ$  Not really a part of PMDK...

- CPU utilization is low in shared environment (more VMs)
- > ... but it's the easiest way to take advantage of Persistent Memory

```
char *memory = malloc(sizeof(struct my_object));
strcpy(memory, "Hello World");
```

#### > Memory is automatically placed in PMEM, with caching in DRAM



### **DIFFERENT WAYS TO USE PERSISTENT MEMORY**





## LIBMEMKIND

#### When To Use

- ➢ application can be modified
- different tiers of objects (hot, warm) can be identified
- persistence is not required

Explicitly manage allocations from App Direct, allowing for fine-grained control of DRAM/PMEM

```
struct memkind *pmem_kind = NULL;
size_t max_size = 1 << 30; /* gigabyte */</pre>
```

```
/* Create PMEM partition with specific size */
memkind_create_pmem(PMEM_DIR, max_size, &pmem_kind);
```

```
/* allocate 512 bytes from 1 GB available */
char *pmem_string = (char *)memkind_malloc(pmem_kind, 512);
```

```
/* deallocate the pmem object */
memkind_free(pmem_kind, pmem_string);
```

#### > The application can decide what type of memory to use for objects

SPDK, PMDK & Vtune<sup>™</sup> Summit



### **DIFFERENT WAYS TO USE PERSISTENT MEMORY**





## LIBVMEMCACHE

#### When To Use

- ➤ caching large quantities of data
- ➢ low latency of operations is needed
- persistence is not required

Seamless and easy-to-use LRU caching solution for persistent memory Keys reside in DRAM, values reside in PMEM

```
VMEMcache *cache = vmemcache_new();
vmemcache_add(cache, "/tmp");
const char *key = "foo";
vmemcache_put(cache, key, strlen(key), "bar", sizeof("bar"));
char buf[128];
ssize_t len = vmemcache_get(cache, key, strlen(key),
    buf, sizeof(buf), 0, NULL);
vmemcache delete(cache);
```

### > Designed for easy integration with existing systems







# LIBPMEMKV

#### When To Use

- storing large quantities of data
- $\succ$  low latency of operations is needed
- ➤ persistence is required

Local/embedded key-value datastore optimized for persistent memory. Provides different language bindings and storage engines.

```
// add the given key-value pair
if (kv->put(argv[2], argv[3]) != status::OK) {
    cerr << db::errormsg() << endl;</pre>
    exit(1);
}
// lookup the given key and print the value
auto ret = kv->get(argv[2], [&](string_view value) {
    cout << argv[2] << "=\"" << value.data() << "\"" << endl;</pre>
});
if (ret != status::OK) {
    cerr << db::errormsg() << endl;</pre>
    exit(1);
```





# LIBPMEMOBJ

#### When To Use

direct byte-level access to objects is needed

- using custom storage-layer algorithms
- persistence is required

Transactional object store, providing memory allocation, transactions, and general facilities for persistent memory programming.

```
typedef struct foo {
    PMEMoid bar; // persistent pointer
    int value;
} foo;
int main() {
    PMEMobjpool *pop = pmemobj_open (...);
    TX_BEGIN(pop) {
        TOID(foo) root = POBJ_ROOT(foo);
        D_RW(root)->value = 5;
    } TX_END;
}
```

### Flexible and relatively easy way to leverage PMEM







### LIBPMEM

#### When To Use

- > modifying application that already uses memory mapped I/O
- other libraries are too high-level
- > only need low-level PMEM-optimized primitives (memcpy etc)

Low-level library that provides basic primitives needed for persistent memory programming and optimized memcpy/memmove/memset

### > The very basics needed for PMEM programming







# **PROGRAMMING MODEL TOOLS**



# C PROGRAMMING WITH LIBPMEMOBJ

SPDK, PMDK & Vtune<sup>™</sup> Summit



### **TRANSACTION SYNTAX**

```
TX_BEGIN(Pop) {
                  /* the actual transaction code goes here... */
} TX ONCOMMIT {
                  /*
                   * optional – executed only if the above block
                   * successfully completes
                   */
} TX ONABORT {
                  /*
                   * optional - executed if starting the transaction fails
                   * or if transaction is aborted by an error or a call to
                   * pmemobj tx abort()
                   */
} TX_FINALLY {
                  /*
                   * optional - if exists, it is executed after
                   * TX ONCOMMIT or TX_ONABORT block
                   */
} TX_END /* mandatory */
```







# **PERSISTENT MEMORY LOCKS**

- Want locks to live near the data they protect (i.e. inside structs)
- Does the state of locks get stored persistently?
  - Would have to flush to persistence when used
  - Would have to recover locked locks on start-up
    - Might be a different program accessing the file
  - Would run at pmem speeds
- PMEMmutex
  - Runs at DRAM speeds
  - Automatically initialized on pool open



## C++ PROGRAMMING WITH LIBPMEMOBJ



## **C++ QUEUE EXAMPLE: DECLARATIONS**

/\* entry in the queue \*/
struct pmem\_entry {
 persistent\_ptr<pmem\_entry> next;
 p<uint64\_t> value;
};

| persistent_ptr <t></t> | Pointer is really a position-independent<br>Object ID in pmem.<br>Gets rid of need to use C macros like<br>D_RW()       |
|------------------------|-------------------------------------------------------------------------------------------------------------------------|
| p <t></t>              | Field is pmem-resident and needs to be<br>maintained persistently.<br>Gets rid of need to use C macros like<br>TX_ADD() |



**C++ QUEUE EXAMPLE: TRANSACTION** void push(pool\_base &pop, uint64\_t value) { transaction::run(pop, [&] auto n = make persistent<pmem entry>(); n->value = value; n->next = nullptr; if (head == nullptr) { head = tail = n; } else { Transactional (including allocations & tail->next = n; frees) tail = n; SPDK, PMDK & Vtune<sup>™</sup>Summit





# LINKS TO MORE INFORMATION

Find the PMDK (Persistent Memory Development Kit) at <u>http://pmem.io/pmdk/</u> Getting Started

- Intel IDZ persistent memory- <u>https://software.intel.com/en-us/persistent-memory</u>
- Entry into overall architecture <u>http://pmem.io/2014/08/27/crawl-walk-run.html</u>
- Emulate persistent memory <u>http://pmem.io/2016/02/22/pm-emulation.html</u>

Linux Resources

- Linux Community Pmem Wiki <u>https://nvdimm.wiki.kernel.org/</u>
- Pmem enabling in SUSE Linux Enterprise 12 SP2 <u>https://www.suse.com/communities/blog/nvdimm-enabling-suse-linux-enterprise-12-service-pack-2/</u>

Windows Resources

- Using Byte-Addressable Storage in Windows Server 2016 -<u>https://channel9.msdn.com/Events/Build/2016/P470</u>
- Accelerating SQL Server 2016 using Pmem <u>https://channel9.msdn.com/Shows/Data-Exposed/SQL-Server-2016-and-Windows-Server-2016-SCM--FAST</u>

**Other Resources** 

- SNIA Persistent Memory Summit 2018 <u>https://www.snia.org/pm-summit</u>
- Intel manageability tools for Pmem <u>https://01.org/ixpdimm-sw/</u>



