NoSQL.

Overview

Teaching: 180 min
Questions
  • What is polyglot persistence and how solves the data storge problems of the present days?

  • What are the common treats of NoSQL solutions?

  • How can a data analyst choose the appropriate data solution tailored to her/his task?


Objectives
  • Understanding the polyglot persistence

  • Understanding eventual consistency

  • Understanding SQL VS NoSQL

  • Understanding NoSQL families

  • Introducing Key-Value Stores

  • Introducing Jupyter Notebook

  • Practice with Redis

  • Introducing Time Series DBMS

  • Introducing Document Stores

  • Practice of MongoDB with Airbnb data

  • Introducing Column Stores

  • Introducing Search Engines

  • Practice of Solr with Kaggle Flight data

  • Introducing Graph DBMS

  • Practice of Neo4J with Paradise Papers

Keywords

#BASE

#NOSQL FAMILIES

#JUPYTER NOTEBOOK

#REDIS

#MONGO

#SOLR

#NETWORK SCIENCE

#NEO4J

#PARADISE PAPERS

Table of Content

Lecture PPTX

Redis

MongoDB

Solr

Neo4j



Redis

https://redis-py.readthedocs.io/en/stable/index.html

https://redis.io/commands#


Exercise interface: http://de1.ceudsd.com (Jupyter Notebook)

Exercise notebook

Username/Password will be distributed during class. Ask on Teams if you haven’t got one.



Connect to Redis with Python

import redis
r = redis.Redis(host='localhost', port=8082)

Set a value with a key

r.set('laszlo_sallo', 'bar')

Get value by key

r.get('laszlo_sallo')

Overwrite with expiration

r.set('laszlo_sallo','efwefwefgwerg',ex=5)

Is it there?

r.get('laszlo_sallo')

Set a number

r.set('laszlo_sallo_nr', 10)
r.get('laszlo_sallo_nr')

Increase number value

r.incr("laszlo_sallo_nr")
r.get('laszlo_sallo_nr')

Store multiple key-value

r.mset({'laszlo_sallo_one': 1, 'laszlo_sallo_two': 2, 'laszlo_sallo_three': 3})

Display all keys stored in DB

r.keys()

Retrieve multiple values by key

r.mget('laszlo_sallo_one','laszlo_sallo_three')


REDIS Exercise

USING THE DOCUMENTATION, FIND HOW TO DELETE A VALUE BY KEY AND HOW TO CHECK THE EXISTENCE OF A KEY.




MongoDB

https://docs.mongodb.com/manual/

https://www.w3schools.com/python/python_mongodb_getstarted.asp

AirBnb dataset


Exercise notebook


Connect to MongoDB with Python

import pymongo
import pprint
mongo = pymongo.MongoClient("mongodb://localhost:27017")


Select a database

If database does not exist, will be created with the first data write (eg. insert line)

db = mongo["mydatabase"]


Create a collection

customers = db["laszlo_sallo"]


Insert a document

id = customers.insert_one({ "name": "John", "address": "Boston Highway 37" }).inserted_id


Find

Find the customer inserted by id. Pretty print the result.

pprint.pprint(customers.find_one({"_id": id}))

Find multiple customers by “name” field. Inverse sort by “address”. Limit to 5. In order to print the result we iterating over the result set and pretty print each resulting JSON.

for customer in customers.find({"name": "John"}).sort("address",-1).limit(5):
    pprint.pprint(customer)


Count

customers.count_documents({"name": "John"})


Distinct

Insert a row with another John into the new collection and then find the customers called “John” which address starts with “Bos” and print out distinct addresses.

id = customers.insert_one({ "name": "John", "address": "Boole 01" }).inserted_id

for customer in customers.find({"name":"John","address": {"$regex": "^Boston"}}).distinct("address"):
    pprint.pprint(customer)


List collections stored in the database

db.list_collection_names()


Airbnb Sample database

airbnb = db["airbnb"]
pprint.pprint(airbnb.find_one())


Advance filtering

Filter by a deeper JSON field. Print only part of JSON.

for listing in airbnb.find({ "address.country": "Spain" }).limit(10):
    pprint.pprint(listing['address']['government_area'])


MONGO Exercise

COUNT HOW MANY AIRBNB LISTINGS WE HAVE IN THE SAMPLE DATABASE HAVING “COUNTRY_CODE” “US” OR “ADDRESS.MARKET” STARTWITH “M” (USE MONGODB DOCUMENTATION)

Solution

Just a hint, the solution in SQL is something like:

SELECT COUNT(*) FROM airbnb WHERE country_code='US' OR market LIKE 'M%'




Solr

https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser

http://yonik.com/solr/query-syntax/


Exercise interface: http://solr.ceudsd.com/solr/#/flightdelays/query

Exercise notebook



A simple query

SOLR has different connectors to programming languages. For simple query testing, we don’t need to program because SOLR is offering so called HTTP Rest interface. These are basically url calls from a browser.

The simplest query (the result is limited by default to 10):

http://solr.ceudsd.com/solr/flightdelays/select?q=*:*


In SQL, this would be something like:

SELECT * FROM flightdelays LIMIT 10;


Ranges

List records from the last 10 years where tail number is N520JB:

http://solr.ceudsd.com/solr/flightdelays/select?fl=DISTANCE,ORIG_CITY,DEST_CITY&q=TAIL_NUMBER:N838UA AND DATE:[NOW-10YEARS TO *]&sort=DISTANCE desc&rows=5


In SQL, this would be something like:

SELECT distance,orig_city,dest_city FROM flightdelays 
  WHERE tail_number='N520JB' AND date >= DATE_SUB(NOW(),INTERVAL 10 YEAR) 
  ORDER BY distance DESC 
  LIMIT 5;



List records where tail numbers starting with any character, followed by “2”, followed by 2 any character, followed by “jb”. Display only tail number in the result set:

http://solr.ceudsd.com/solr/flightdelays/select?fl=TAIL_NUMBER&q=TAIL_NUMBER:?8???a

Fuzzy searches is based on the Damerau-Levenshtein Distance or Edit Distance algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily being an exact match. To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term

In the next example we list records with destination city close to “columbas” by distance of 2. The distance referred to here is the number of term movements needed to match the specified phrase.

http://solr.ceudsd.com/solr/flightdelays/select?fl=DEST_CITY&q=DEST_CITY:kolumbas~2


Facets

Same as before, but this time return back distinct destination cities as well:

http://solr.ceudsd.com/solr/flightdelays/select?q=DEST_CITY:Boise~3&facet.field=DEST_CITY_str&facet=on&rows=0


This previous result sounds like a combined result of the following SQLs:

SELECT * FROM flightdelays WHERE DEST_CITY LIKE 'columbas%' LIMIT 10;

SELECT dest_city, COUNT(*) FROM flightdelays 
  WHERE DEST_CITY LIKE 'columbas%' 
  GROUP BY dest_city;


Return back records within a circle defined by center point of 39.85,-104.66 [lat,lon] and diameter of 2 kilometer. Display only ORIG_CITY and ORIG_LOCATION_p in the result set and facests for ORIG_CITY_str.

http://solr.ceudsd.com/solr/flightdelays/select?d=2&facet.field=ORIG_CITY_str&facet=on&fl=ORIG_CITY,ORIG_LOCATION_p,&fq={!geofilt}&pt=39.85,-104.66&q=*:*&sfield=ORIG_LOCATION_p


SOLR Exercise

HOW MANY FLIGHTS ARRIVED IN SAN FRANCISCO WITH NO DELAY ALTHOUGH THEY DEPARTED AT LEAST 50 MINS BEHIND THE SCHEDULE?



Connect to SOLR with Python

import pysolr
from requests.auth import HTTPBasicAuth
import pprint

solr = pysolr.Solr('http://localhost:8081/solr/flightdelays',auth=HTTPBasicAuth('xxx','xxx'))


Query the first 10 records

results = solr.search('*:*')
for result in results:
    pprint.pprint(result)


A more complicated query

List records from the last 10 years where tail number is N520JB:

results = solr.search('TAIL_NUMBER:N838UA AND DATE:[NOW-10YEARS TO *]',fl="DISTANCE,ORIG_CITY,DEST_CITY",sort="DISTANCE desc",rows=10)
for result in results:
    pprint.pp(result,width=50)




Neo4j

https://neo4j.com/developer/cypher-query-language/

http://neo4j.com/docs/developer-manual/current/cypher/

https://neo4j.com/blog/analyzing-paradise-papers-neo4j/


Exercise interface: https://sandbox.neo4j.com/


Register and start a sandbox with “Paradise Papers by ICIJ”

Exercise notebook



Simple queries

In Neo4J the SELECT is called MATCH. One of the simplest query is selecting 25 Officer nodes :

MATCH (n:Entity) 
RETURN n.name LIMIT 25


In SQL, this would be something like:

SELECT n.name FROM Entity AS n LIMIT 25;

We can use WHERE clause to filter our result:

MATCH (o:Officer)
WHERE o.countries CONTAINS 'Hungary'
RETURN o


In SQL, this would be something like:

SELECT o.countries FROM officer AS o WHERE o.countries LIKE '%Hungary%';


NEO4J Exercise 1

RETURN THE FIRST 10 ADDRESS NODES

NEO4J Exercise 2

HOW MANY PROPERTIES AN ADDRESS NODE HAS?

NEO4J Exercise 3

RETURN THE FIRST 10 COUNTRIES OF THE ADDRESS NODE. WHAT IS THE LAST COUNTRY IN THE LIST?

NEO4J Exercise4

HOW MANY ADDRESS NODES HAS ‘Mexico’ AND ‘Monaco’ IN THEIR ADDRESS PROPERTY?


Joins

Find the Officers and the Entities linked to them (double MATCH)

MATCH (o:Officer) 
MATCH p=(o)-[r]-(c:Entity)
RETURN p
LIMIT 50


In SQL, this would be something like:

SELECT * 
FROM officer as o  
INNER JOIN Entity as c 
USING (relationship)

Find the nodes associated “the duchy of lancaster”:

MATCH (o:Officer) 
WHERE toLower(o.name) CONTAINS "the duchy of lancaster"
MATCH p=(o)-[r]-(c)
RETURN p

Same, but this time return the nodes 2 hops away:

MATCH (o:Officer) 
WHERE toLower(o.name) CONTAINS "the duchy of lancaster"
MATCH p=(o)-[*..2]-(c)
RETURN p


Count

Which country has the most addresses

MATCH (n:Address) 
RETURN n.countries, count(*)
ORDER BY count(*) DESC
LIMIT 10


NEO4J Exercise 5

LIST THE NAME AND NUMBER CONNECTIONS OF THE TOP 10 MOST CONNECTED OFFICERS FROM BULGARIA.WHO IS THE NO1?



Connect to neo4j with Python

from neo4j import GraphDatabase, basic_auth

driver = GraphDatabase.driver("neo4j://18.208.109.82:7687", auth=basic_auth("neo4j", "overlays-consoles-regulations"))


Query and display

cypher_query = '''
MATCH (o:Officer)
WHERE o.countries CONTAINS $country
RETURN o.name as name
LIMIT 10
'''

with driver.session(database="neo4j") as session:
  results = session.execute_read(lambda tx: tx.run(cypher_query,country="Hungary").data())
    
  for record in results:
    print(record['name'])


Close connection

driver.close()