API - Database¶
This is the alpha version of database management system. If you have trouble, you can ask for help on fangde.liu@imperial.ac.uk .
Note
We are writing up the documentation, please wait in patient.
Why TensorDB¶
TensorLayer is designed for production, aiming to be applied large scale machine learning application. TensorDB introduce the database infracture to address the many challenges in large scale machine learning project, such as:
- How to mangage the training data and load the training datasets
- When the dataset is so large that beyonds the storage limitation of one computer
- How shoud we managment different models and version, and comparing different models.
- How to automate the whole training, evaluaiton and deploy machine learning model automatically.
In TensorLayer system, we introduce the database technology to the issues above.
TensorDB is designed by following three principles.
Everything is Data¶
TensorDB is a data warehouse that stores that capture the whole machine learning development process. the data inside tensordb can be catagloried as:
- Data and Labels: Which includes all the data for training, validation and prediction. The labels can be manually labelled or generated by machine
- Model Architecture: This group store the different model architecture, which user can select to use
- Model Parameters: This tables stores all the model parameters of echo in the training step.
- Jobs: All the computation is cutted into several jobs. Each jobs constains some computing work load. for training , the jobs includes training data , the model parameter, the model architecture, how many epochs the training want to do. Similarity are the validation jobs and inference jobs.
- Logs: The logs store all the step time and accuracy and other metric of each training steps and also the time stamps.
TensorDB in principal is a key-word based search engine. Each model, parameters, or training data are assigned many tags. The data are stored in two layers. On the top, there is the index layer, which instore the blob storage reference with all the tags assigned to the data, which is implemented based on NoSQL document database such as Mongodb. The second layer is used store big chunk of data, such as videos, medical images or image mask, which is usually implemented as file system. Our open source implementation is implemented based MongoDB. The blob data is in store in the gridfs while the tag index is stored in the documents.
Everying is identified by Query¶
Within TensorDB framework, any entity within the data warehouse, such as the data, model or jobs are specified by the database query language. The first advantage is the query is more efficient in space and can specify multiple objects in a concise way. The advantage such a design is to enable a highly flexible software system. data, model architecture and training are interchangeable. Many work can be implemented by simply rewire different components. This enable us to develop many new application just by change the query without change any applicaition code.
An pulling based Stream processing pipeline.¶
Also with a large dataset, we can assume that the data is unlimited. TensorDB provides a streaming interface, implemented in python as generators, it keeps return the new data during training. Also the training system have no clue of epochs, instead, it knows batchize and store parameters after how many steps.
Many techniques are introduced behind the streaming interface. The stream is implemented based on the database cursor technology, so for every search, only the cursors are returned, not the actual data. Only when the generator is evaluated, the acutal data is loaded. The data loading is further optimise:
- Data are compressed and decompressed,
- The dataloaded in bulk model to optimise the IO traffic
- The argumentation or random sample are computed on the fly after the data are loaded into the local computer.
- To optimise the space, the will also be a cache system that only store the recent blob data.
Based on streaming interface, TensorLayer can be implemented as a continuous machine learning. On the distributed system, the model training, validation and deployment can be running on different computers which all running continuously. The trainer can keeps on optimising the models, the evaluation keeps evaluating the recent added models and the deployment system keeps pulling the best models from the TensorDB warehouse.
Preparation¶
In principle, TensorDB is can be implemented on any documents NoSQL database system. The exisitng implementation is based on Mongodb. Further implementaiton on other database will be released depends on progress. It will be stragihtford to port the tensorDB system to google cloud , aws and azure.
The following tutorials are based on the MongoDb implmenetation.
Install MongoDB¶
The installation instruction of Mongodb can be found at MongoDB Docs there are also managed mongodb service from amazon or gcp, or mongo atlas from mongodb
User can also user docker, which is a powerful tool for deploy software .
After install mongodb, a mongod db management tool with graphic user interface will be extremely valuale.
Users can install the Studio3T( mongochef), which is free for none commerical user interface. studio3t
Start MongoDB service¶
After mongodb is installed, you shoud start the database.
mongod start
You can specificy the path the database files with -d
flag
Quick Start¶
A fully working example with mnist training set is the _TensorLabDemo.ipnb_
Connect to database¶
To use TensorDB mongodb implmentaiton, you need pymongo client.
you can install it by
pip install pymongo
pip install lz4
it is very strateford to connected to the TensorDB system. you can try the following code
from tensorlayer.db import TensorDB
db = TensorDB(ip='127.0.0.1', port=27017, db_name='your_db', user_name=None, password=None, studyID='ministMLP')
The ip
is the ip address of the database, and port
number is number of mongodb.
You may need to specificy the database name and studyid.
The study id is an unique identifier for an experiement.
TensorDB stores different study in one data warehouse. This has pros and cons, the benefits is that suppose the each study we try a different model architecutre, it is very easy for us to evaluate different model architecture.
log and parameters¶
The basic application is use TensorDB to save the model parameters and training/evaluation/testing logs. to use tensorDB, this can be easily done by replacing the print function by the db.log function
For save the trainning log, we have
db.train_log
and
db. save_parameter
methods
Suppose we save the log each step and save the parameters each epoch, we can have the code like this
for epoch in range(0,epoch_count):
[~,ac]=sess.run([train_op,loss],feed_dict({x:x,y:y_}
db.train_log({'accuracy':ac})
db.save_parameter(sess.run(network.all_parameters),{'acc':ac})
the code for save validation log and test log are similar.
Model Architecture and Jobs¶
TensorDb also supporting the model architecture and jobs system in the current version, both the model architecture and job are just simply strings. it is up to the user to specifiy how to convert the string back to models or job. for example, in many our our cases, we just simpliy specify the python code.
code= '''
print "hello
'''
db.save_model_architecutre(code,{'name':'print'}
c,fid = db.find_model_architecutre({'name':'print'})
exec c
db.push_job(code,{'type':'train'})
## worker
code = db.pop_job()
exec code
Database Interface¶
The trainning set is managed by a seperate database. each application has its own database. However, all the database interface should support two interface, 1. find_data, 2. data_generator
and example for minist dataset is include in the TensorLabDemo code
Data Importing¶
With a database, the development workflow is very flexible. As long as the comtent in the database in the same, user can use whatever tools to write into the database
the TesorLabDemo has an import data interface, which allow the user to injecting data in future
user can import data by the following code
db.import_data(X,y,{'type':'train'})
Application Framework¶
In fact, in real application, we rarely code everything from scrach and using the tensorDB interface directly. as demostrate in the TensorLabDemo
we implemented 4 class each with a well defined interace. 1. The dataset. 2. The TensorDb 3. The Model, model is loggically a full compoment can be trained, evaluate and deployed. It has property like parameters 4. The DBLogger, which is connecttor from model to tensorDB, which is implemented as callback functions, automatically called at each batch_step and each epoch.
users can based on the TensorLabDemo code, overrite the interface to suits their own applicaions needs.
when training, the overall archtiecture is first, find a data generator from the dataset module
g=datase.data_generator({"type":[your type]})
then intialize a model with a name
m=model('mytes')
during training, connected the db logger and tensordb togehter
m.fit_generator(g,dblogger(tensordb,m),1000,100)
if the work is distributed, we have to save the model archtiecture and reload and excute it
db.save_model_architecture(code,{'name':'mlp'})
db.push_job({'name':'mlp'},{'type':XXXX},{'batch:1000','epoch':100)
the worker will run the job as the following code
j=job.pop
g=dataset.data_generator(j.filter)
c=tensordb.load_model_architecutre(j.march)
exec c
m=model()
m.fit_generator(g,dblooger(tensordb,m),j.bach_size,j.epoch}
Experimental Database Management System.
Latest Version
-
class
tensorlayer.db.
TensorDB
(ip='localhost', port=27017, db_name='db_name', user_name=None, password='password', studyID=None)[source]¶ TensorDB is a MongoDB based manager that help you to manage data, network topology, parameters and logging.
Parameters: - ip : string, localhost or IP address.
- port : int, port number.
- db_name : string, database name.
- user_name : string, set to None if it donnot need authentication.
- password : string.
Methods
save_params
([params, args])Save parameters into MongoDB Buckets, and save the file ID into Params Collections. del_job del_params del_test_log del_train_log del_valid_log find_all_params find_one_job find_one_params load_model_architecture peek_job push_job run_job save_job save_model_architecture test_log train_log valid_log