NoSQL.
Overview
Teaching: 180 minQuestions
What is polyglot persistence and how solves the data storge problems of the present days?
What are the common treats of NoSQL solutions?
How can a data analyst choose the appropriate data solution tailored to her/his task?
Objectives
Understanding the polyglot persistence
Understanding eventual consistency
Understanding SQL VS NoSQL
Understanding NoSQL families
Introducing Key-Value Stores
Introducing Jupyter Notebook
Practice with Redis
Introducing Time Series DBMS
Introducing Document Stores
Practice of MongoDB with Airbnb data
Introducing Column Stores
Introducing Search Engines
Practice of Solr with Kaggle Flight data
Introducing Graph DBMS
Practice of Neo4J with Paradise Papers
Keywords
#BASE
#NOSQL FAMILIES
#JUPYTER NOTEBOOK
#REDIS
#MONGO
#SOLR
#NETWORK SCIENCE
#NEO4J
#PARADISE PAPERS
Table of Content
Redis
Links to help you
https://redis-py.readthedocs.io/en/stable/index.html
Exercise interface: http://de1.ceudsd.com (Jupyter Notebook)
Exercise notebook
Username/Password will be distributed during class. Ask on Teams if you haven’t got one.
Connect to Redis with Python
import redis
r = redis.Redis(host='localhost', port=8082)
Set a value with a key
r.set('laszlo_sallo', 'bar')
Get value by key
r.get('laszlo_sallo')
Overwrite with expiration
r.set('laszlo_sallo','efwefwefgwerg',ex=5)
Is it there?
r.get('laszlo_sallo')
Set a number
r.set('laszlo_sallo_nr', 10)
r.get('laszlo_sallo_nr')
Increase number value
r.incr("laszlo_sallo_nr")
r.get('laszlo_sallo_nr')
Store multiple key-value
r.mset({'laszlo_sallo_one': 1, 'laszlo_sallo_two': 2, 'laszlo_sallo_three': 3})
Display all keys stored in DB
r.keys()
Retrieve multiple values by key
r.mget('laszlo_sallo_one','laszlo_sallo_three')
REDIS Exercise
USING THE DOCUMENTATION, FIND HOW TO DELETE A VALUE BY KEY AND HOW TO CHECK THE EXISTENCE OF A KEY.
MongoDB
Links to help you
https://docs.mongodb.com/manual/
https://www.w3schools.com/python/python_mongodb_getstarted.asp
Exercise notebook
Connect to MongoDB with Python
import pymongo
import pprint
mongo = pymongo.MongoClient("mongodb://localhost:27017")
Select a database
If database does not exist, will be created with the first data write (eg. insert line)
db = mongo["mydatabase"]
Create a collection
customers = db["laszlo_sallo"]
Insert a document
id = customers.insert_one({ "name": "John", "address": "Boston Highway 37" }).inserted_id
Find
Find the customer inserted by id. Pretty print the result.
pprint.pprint(customers.find_one({"_id": id}))
Find multiple customers by “name” field. Inverse sort by “address”. Limit to 5. In order to print the result we iterating over the result set and pretty print each resulting JSON.
for customer in customers.find({"name": "John"}).sort("address",-1).limit(5):
pprint.pprint(customer)
Count
customers.count_documents({"name": "John"})
Distinct
Insert a row with another John into the new collection and then find the customers called “John” which address starts with “Bos” and print out distinct addresses.
id = customers.insert_one({ "name": "John", "address": "Boole 01" }).inserted_id
for customer in customers.find({"name":"John","address": {"$regex": "^Boston"}}).distinct("address"):
pprint.pprint(customer)
List collections stored in the database
db.list_collection_names()
Airbnb Sample database
airbnb = db["airbnb"]
pprint.pprint(airbnb.find_one())
Advance filtering
Filter by a deeper JSON field. Print only part of JSON.
for listing in airbnb.find({ "address.country": "Spain" }).limit(10):
pprint.pprint(listing['address']['government_area'])
MONGO Exercise
COUNT HOW MANY AIRBNB LISTINGS WE HAVE IN THE SAMPLE DATABASE HAVING “COUNTRY_CODE” “US” OR “ADDRESS.MARKET” STARTWITH “M” (USE MONGODB DOCUMENTATION)
Solution
Just a hint, the solution in SQL is something like:
SELECT COUNT(*) FROM airbnb WHERE country_code='US' OR market LIKE 'M%'
Solr
Links to help you
https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
http://yonik.com/solr/query-syntax/
Exercise interface: http://solr.ceudsd.com/solr/#/flightdelays/query
Exercise notebook
A simple query
SOLR has different connectors to programming languages. For simple query testing, we don’t need to program because SOLR is offering so called HTTP Rest interface. These are basically url calls from a browser.
The simplest query (the result is limited by default to 10):
http://solr.ceudsd.com/solr/flightdelays/select?q=*:*
In SQL, this would be something like:
SELECT * FROM flightdelays LIMIT 10;
Ranges
List records from the last 10 years where tail number is N520JB:
http://solr.ceudsd.com/solr/flightdelays/select?fl=DISTANCE,ORIG_CITY,DEST_CITY&q=TAIL_NUMBER:N838UA AND DATE:[NOW-10YEARS TO *]&sort=DISTANCE desc&rows=5&wt=python
In SQL, this would be something like:
SELECT distance,orig_city,dest_city FROM flightdelays
WHERE tail_number='N520JB' AND date >= DATE_SUB(NOW(),INTERVAL 10 YEAR)
ORDER BY distance DESC
LIMIT 5;
String search / Fuzzy search
List records where tail numbers starting with any character, followed by “2”, followed by 2 any character, followed by “jb”. Display only tail number in the result set:
http://solr.ceudsd.com/solr/flightdelays/select?fl=TAIL_NUMBER&q=TAIL_NUMBER:?8???a&wt=python
Fuzzy searches is based on the Damerau-Levenshtein Distance or Edit Distance algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily being an exact match. To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term
In the next example we list records with destination city close to “columbas” by distance of 2. The distance referred to here is the number of term movements needed to match the specified phrase.
http://solr.ceudsd.com/solr/flightdelays/select?fl=DEST_CITY&q=DEST_CITY:kolumbas~2&wt=python
Facets
Same as before, but this time return back distinct destination cities as well:
http://solr.ceudsd.com/solr/flightdelays/select?q=DEST_CITY:Boise~3&facet.field=DEST_CITY_str&facet=on&rows=0&wt=python
This previous result sounds like a combined result of the following SQLs:
SELECT * FROM flightdelays WHERE DEST_CITY LIKE 'columbas%' LIMIT 10;
SELECT dest_city, COUNT(*) FROM flightdelays
WHERE DEST_CITY LIKE 'columbas%'
GROUP BY dest_city;
Geo spacial search
Return back records within a circle defined by center point of 39.85,-104.66 [lat,lon] and diameter of 2 kilometer. Display only ORIG_CITY and ORIG_LOCATION_p in the result set and facests for ORIG_CITY_str.
http://solr.ceudsd.com/solr/flightdelays/select?d=2&facet.field=ORIG_CITY_str&facet=on&fl=ORIG_CITY,ORIG_LOCATION_p,&fq={!geofilt}&pt=39.85,-104.66&q=*:*&sfield=ORIG_LOCATION_p&wt=python
SOLR Exercise
HOW MANY FLIGHTS ARRIVED IN SAN FRANCISCO WITH NO DELAY ALTHOUGH THEY DEPARTED AT LEAST 50 MINS BEHIND THE SCHEDULE?
Connect to SOLR with Python
import pysolr
from requests.auth import HTTPBasicAuth
import pprint
solr = pysolr.Solr('http://localhost:8081/solr/flightdelays',auth=HTTPBasicAuth('xxx','xxx'))
Query the first 10 records
results = solr.search('*:*')
for result in results:
pprint.pprint(result)
A more complicated query
List records from the last 10 years where tail number is N520JB:
results = solr.search('TAIL_NUMBER:N838UA AND DATE:[NOW-10YEARS TO *]',fl="DISTANCE,ORIG_CITY,DEST_CITY",sort="DISTANCE desc",rows=10)
for result in results:
pprint.pp(result,width=50)
Neo4j
Links to help you
https://neo4j.com/developer/cypher-query-language/
http://neo4j.com/docs/developer-manual/current/cypher/
https://neo4j.com/blog/analyzing-paradise-papers-neo4j/
Exercise interface: https://sandbox.neo4j.com/
Register and start a sandbox with “Paradise Papers by ICIJ”
Exercise notebook
Simple queries
In Neo4J the SELECT is called MATCH. One of the simplest query is selecting 25 Officer nodes :
MATCH (n:Entity)
RETURN n.name LIMIT 25
In SQL, this would be something like:
SELECT n.name FROM Entity AS n LIMIT 25;
We can use WHERE clause to filter our result:
MATCH (o:Officer)
WHERE o.countries CONTAINS 'Hungary'
RETURN o
In SQL, this would be something like:
SELECT o.countries FROM officer AS o WHERE o.countries LIKE '%Hungary%';
NEO4J Exercise 1
RETURN THE FIRST 10 ADDRESS NODES
NEO4J Exercise 2
HOW MANY PROPERTIES AN ADDRESS NODE HAS?
NEO4J Exercise 3
RETURN THE FIRST 10 COUNTRIES OF THE ADDRESS NODE. WHAT IS THE LAST COUNTRY IN THE LIST?
NEO4J Exercise4
HOW MANY ADDRESS NODES HAS ‘Mexico’ AND ‘Monaco’ IN THEIR ADDRESS PROPERTY?
Joins
Find the Officers and the Entities linked to them (double MATCH)
MATCH (o:Officer)
MATCH p=(o)-[r]-(c:Entity)
RETURN p
LIMIT 50
In SQL, this would be something like:
SELECT *
FROM officer as o
INNER JOIN Entity as c
USING (relationship)
Find the nodes associated “the duchy of lancaster”:
MATCH (o:Officer)
WHERE toLower(o.name) CONTAINS "the duchy of lancaster"
MATCH p=(o)-[r]-(c)
RETURN p
Same, but this time return the nodes 2 hops away:
MATCH (o:Officer)
WHERE toLower(o.name) CONTAINS "the duchy of lancaster"
MATCH p=(o)-[*..2]-(c)
RETURN p
Count
Which country has the most addresses
MATCH (n:Address)
RETURN n.countries, count(*)
ORDER BY count(*) DESC
LIMIT 10
NEO4J Exercise 5
LIST THE NAME AND NUMBER CONNECTIONS OF THE TOP 10 MOST CONNECTED OFFICERS FROM BULGARIA.WHO IS THE NO1?
Connect to neo4j with Python
from neo4j import GraphDatabase, basic_auth
driver = GraphDatabase.driver("neo4j://18.208.109.82:7687", auth=basic_auth("neo4j", "overlays-consoles-regulations"))
Query and display
cypher_query = '''
MATCH (o:Officer)
WHERE o.countries CONTAINS $country
RETURN o.name as name
LIMIT 10
'''
with driver.session(database="neo4j") as session:
results = session.execute_read(lambda tx: tx.run(cypher_query,country="Hungary").data())
for record in results:
print(record['name'])
Close connection
driver.close()