NoSQL.

Overview

Teaching: 180 min

Questions

What is polyglot persistence and how solves the data storge problems of the present days?

What are the common treats of NoSQL solutions?

How can a data analyst choose the appropriate data solution tailored to her/his task?

Objectives

Understanding the polyglot persistence

Understanding eventual consistency

Understanding SQL VS NoSQL

Understanding NoSQL families

Introducing Key-Value Stores

Introducing Jupyter Notebook

Practice with Redis

Introducing Time Series DBMS

Introducing Document Stores

Practice of MongoDB with Airbnb data

Introducing Column Stores

Introducing Search Engines

Practice of Solr with Kaggle Flight data

Introducing Graph DBMS

Practice of Neo4J with Paradise Papers

Keywords

#BASE

#NOSQL FAMILIES

#JUPYTER NOTEBOOK

#REDIS

#MONGO

#SOLR

#NETWORK SCIENCE

#NEO4J

#PARADISE PAPERS

Table of Content

Lecture PPTX

Redis

MongoDB

Solr

Neo4j

Redis

Links to help you

https://redis-py.readthedocs.io/en/stable/index.html

https://redis.io/commands#

Exercise interface: http://de1.ceudsd.com (Jupyter Notebook)

Exercise notebook

Username/Password will be distributed during class. Ask on Teams if you haven’t got one.

Connect to Redis with Python

import redis
r = redis.Redis(host='localhost', port=8082)

Set a value with a key

r.set('laszlo_sallo', 'bar')

Get value by key

r.get('laszlo_sallo')

Overwrite with expiration

r.set('laszlo_sallo','efwefwefgwerg',ex=5)

Is it there?

r.get('laszlo_sallo')

Set a number

r.set('laszlo_sallo_nr', 10)
r.get('laszlo_sallo_nr')

Increase number value

r.incr("laszlo_sallo_nr")
r.get('laszlo_sallo_nr')

Store multiple key-value

r.mset({'laszlo_sallo_one': 1, 'laszlo_sallo_two': 2, 'laszlo_sallo_three': 3})

Display all keys stored in DB

r.keys()

Retrieve multiple values by key

r.mget('laszlo_sallo_one','laszlo_sallo_three')

REDIS Exercise

USING THE DOCUMENTATION, FIND HOW TO DELETE A VALUE BY KEY AND HOW TO CHECK THE EXISTENCE OF A KEY.

MongoDB

Links to help you

https://docs.mongodb.com/manual/

https://www.w3schools.com/python/python_mongodb_getstarted.asp

AirBnb dataset

Connect to MongoDB with Python

import pymongo
import pprint
mongo = pymongo.MongoClient("mongodb://localhost:27017")

Select a database

If database does not exist, will be created with the first data write (eg. insert line)

db = mongo["mydatabase"]

Create a collection

customers = db["laszlo_sallo"]

Insert a document

id = customers.insert_one({ "name": "John", "address": "Boston Highway 37" }).inserted_id

Find

Find the customer inserted by id. Pretty print the result.

pprint.pprint(customers.find_one({"_id": id}))

Find multiple customers by “name” field. Inverse sort by “address”. Limit to 5. In order to print the result we iterating over the result set and pretty print each resulting JSON.

for customer in customers.find({"name": "John"}).sort("address",-1).limit(5):
    pprint.pprint(customer)

Count

customers.count_documents({"name": "John"})

Distinct

Insert a row with another John into the new collection and then find the customers called “John” which address starts with “Bos” and print out distinct addresses.

id = customers.insert_one({ "name": "John", "address": "Boole 01" }).inserted_id

for customer in customers.find({"name":"John","address": {"$regex": "^Boston"}}).distinct("address"):
    pprint.pprint(customer)

List collections stored in the database

db.list_collection_names()

Airbnb Sample database

airbnb = db["airbnb"]
pprint.pprint(airbnb.find_one())

Advance filtering

Filter by a deeper JSON field. Print only part of JSON.

for listing in airbnb.find({ "address.country": "Spain" }).limit(10):
    pprint.pprint(listing['address']['government_area'])

MONGO Exercise

COUNT HOW MANY AIRBNB LISTINGS WE HAVE IN THE SAMPLE DATABASE HAVING “COUNTRY_CODE” “US” OR “ADDRESS.MARKET” STARTWITH “M” (USE MONGODB DOCUMENTATION)
Solution

Just a hint, the solution in SQL is something like:
SELECT COUNT(*) FROM airbnb WHERE country_code='US' OR market LIKE 'M%'

Solr

Links to help you

https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser

http://yonik.com/solr/query-syntax/

Exercise interface: http://solr.ceudsd.com/solr/#/flightdelays/query

A simple query

SOLR has different connectors to programming languages. For simple query testing, we don’t need to program because SOLR is offering so called HTTP Rest interface. These are basically url calls from a browser.

The simplest query (the result is limited by default to 10):

http://solr.ceudsd.com/solr/flightdelays/select?q=*:*

In SQL, this would be something like:

SELECT * FROM flightdelays LIMIT 10;

Ranges

List records from the last 10 years where tail number is N520JB:

http://solr.ceudsd.com/solr/flightdelays/select?fl=DISTANCE,ORIG_CITY,DEST_CITY&q=TAIL_NUMBER:N838UA AND DATE:[NOW-10YEARS TO *]&sort=DISTANCE desc&rows=5&wt=python

In SQL, this would be something like:

SELECT distance,orig_city,dest_city FROM flightdelays 
  WHERE tail_number='N520JB' AND date >= DATE_SUB(NOW(),INTERVAL 10 YEAR) 
  ORDER BY distance DESC 
  LIMIT 5;

String search / Fuzzy search

List records where tail numbers starting with any character, followed by “2”, followed by 2 any character, followed by “jb”. Display only tail number in the result set:

http://solr.ceudsd.com/solr/flightdelays/select?fl=TAIL_NUMBER&q=TAIL_NUMBER:?8???a&wt=python

Fuzzy searches is based on the Damerau-Levenshtein Distance or Edit Distance algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily being an exact match. To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term

In the next example we list records with destination city close to “columbas” by distance of 2. The distance referred to here is the number of term movements needed to match the specified phrase.

http://solr.ceudsd.com/solr/flightdelays/select?fl=DEST_CITY&q=DEST_CITY:kolumbas~2&wt=python

Same as before, but this time return back distinct destination cities as well:

http://solr.ceudsd.com/solr/flightdelays/select?q=DEST_CITY:Boise~3&facet.field=DEST_CITY_str&facet=on&rows=0&wt=python

This previous result sounds like a combined result of the following SQLs:

SELECT * FROM flightdelays WHERE DEST_CITY LIKE 'columbas%' LIMIT 10;

SELECT dest_city, COUNT(*) FROM flightdelays 
  WHERE DEST_CITY LIKE 'columbas%' 
  GROUP BY dest_city;

Geo spacial search

Return back records within a circle defined by center point of 39.85,-104.66 [lat,lon] and diameter of 2 kilometer. Display only ORIG_CITY and ORIG_LOCATION_p in the result set and facests for ORIG_CITY_str.

http://solr.ceudsd.com/solr/flightdelays/select?d=2&facet.field=ORIG_CITY_str&facet=on&fl=ORIG_CITY,ORIG_LOCATION_p,&fq={!geofilt}&pt=39.85,-104.66&q=*:*&sfield=ORIG_LOCATION_p&wt=python

SOLR Exercise

HOW MANY FLIGHTS ARRIVED IN SAN FRANCISCO WITH NO DELAY ALTHOUGH THEY DEPARTED AT LEAST 50 MINS BEHIND THE SCHEDULE?

Connect to SOLR with Python

import pysolr
from requests.auth import HTTPBasicAuth
import pprint

solr = pysolr.Solr('http://localhost:8081/solr/flightdelays',auth=HTTPBasicAuth('xxx','xxx'))

Query the first 10 records

results = solr.search('*:*')
for result in results:
    pprint.pprint(result)

A more complicated query

List records from the last 10 years where tail number is N520JB:

results = solr.search('TAIL_NUMBER:N838UA AND DATE:[NOW-10YEARS TO *]',fl="DISTANCE,ORIG_CITY,DEST_CITY",sort="DISTANCE desc",rows=10)
for result in results:
    pprint.pp(result,width=50)

Neo4j

Links to help you

https://neo4j.com/developer/cypher-query-language/

http://neo4j.com/docs/developer-manual/current/cypher/

https://neo4j.com/blog/analyzing-paradise-papers-neo4j/

Exercise interface: https://sandbox.neo4j.com/

Exercise notebook

Simple queries

In Neo4J the SELECT is called MATCH. One of the simplest query is selecting 25 Officer nodes :

MATCH (n:Entity) 
RETURN n.name LIMIT 25

In SQL, this would be something like:

SELECT n.name FROM Entity AS n LIMIT 25;

We can use WHERE clause to filter our result:

MATCH (o:Officer)
WHERE o.countries CONTAINS 'Hungary'
RETURN o

In SQL, this would be something like:

SELECT o.countries FROM officer AS o WHERE o.countries LIKE '%Hungary%';

NEO4J Exercise 1

RETURN THE FIRST 10 ADDRESS NODES

NEO4J Exercise 2

HOW MANY PROPERTIES AN ADDRESS NODE HAS?

NEO4J Exercise 3

RETURN THE FIRST 10 COUNTRIES OF THE ADDRESS NODE. WHAT IS THE LAST COUNTRY IN THE LIST?

NEO4J Exercise4

HOW MANY ADDRESS NODES HAS ‘Mexico’ AND ‘Monaco’ IN THEIR ADDRESS PROPERTY?

Joins

Find the Officers and the Entities linked to them (double MATCH)

MATCH (o:Officer) 
MATCH p=(o)-[r]-(c:Entity)
RETURN p
LIMIT 50

In SQL, this would be something like:

SELECT * 
FROM officer as o  
INNER JOIN Entity as c 
USING (relationship)

Find the nodes associated “the duchy of lancaster”:

MATCH (o:Officer) 
WHERE toLower(o.name) CONTAINS "the duchy of lancaster"
MATCH p=(o)-[r]-(c)
RETURN p

Same, but this time return the nodes 2 hops away:

MATCH (o:Officer) 
WHERE toLower(o.name) CONTAINS "the duchy of lancaster"
MATCH p=(o)-[*..2]-(c)
RETURN p

Count

Which country has the most addresses

MATCH (n:Address) 
RETURN n.countries, count(*)
ORDER BY count(*) DESC
LIMIT 10

NEO4J Exercise 5

LIST THE NAME AND NUMBER CONNECTIONS OF THE TOP 10 MOST CONNECTED OFFICERS FROM BULGARIA.WHO IS THE NO1?

Connect to neo4j with Python

from neo4j import GraphDatabase, basic_auth

driver = GraphDatabase.driver("neo4j://18.208.109.82:7687", auth=basic_auth("neo4j", "overlays-consoles-regulations"))

Query and display

cypher_query = '''
MATCH (o:Officer)
WHERE o.countries CONTAINS $country
RETURN o.name as name
LIMIT 10
'''

with driver.session(database="neo4j") as session:
  results = session.execute_read(lambda tx: tx.run(cypher_query,country="Hungary").data())
    
  for record in results:
    print(record['name'])

Close connection

driver.close()

NoSQL.

Overview

Keywords

Table of Content

Redis

Links to help you

Exercise interface: http://de1.ceudsd.com (Jupyter Notebook)

Connect to Redis with Python

Set a value with a key

Get value by key

Overwrite with expiration

Is it there?

Set a number

Increase number value

Store multiple key-value

Display all keys stored in DB

Retrieve multiple values by key

REDIS Exercise

MongoDB

Links to help you

Connect to MongoDB with Python

Select a database

Create a collection

Insert a document

Find

Count

Distinct

List collections stored in the database

Airbnb Sample database

Advance filtering

MONGO Exercise

Solution

Solr

Links to help you

Exercise interface: http://solr.ceudsd.com/solr/#/flightdelays/query

A simple query

Ranges

String search / Fuzzy search

Facets

Geo spacial search

SOLR Exercise

Connect to SOLR with Python

Query the first 10 records

A more complicated query

Neo4j

Links to help you

Exercise interface: https://sandbox.neo4j.com/

Simple queries

NEO4J Exercise 1

NEO4J Exercise 2

NEO4J Exercise 3

NEO4J Exercise4

Joins

Count

NEO4J Exercise 5

Connect to neo4j with Python

Query and display

Close connection