Algorithms
- EDA (Latent Dirichlet Allocation) – unsupervised, used to discover user-specified number of topics in a text corpus.
- NTM (Neural Topic Model) – unsupervised, used to organize text corpus into topics based on their statistical distribution.
- Object2Vec – an embedding algorithm, learns low dimensional dense embedings from high dimensional objects.
- XGBoost – open-source, supervised, used for regression, classification, and ranking problems.
Measures
Low variance vs high variance (high variance is good for model).
Dimensionality Reduction
- PCA – Principal Component Analysis, linear technique of dimensionality reduction. It maximizes variance of the data in lower dimensional representation.
- NMF – Non-Negative Matrix Factorization, dimensionality reduction, source separation and topic extraction.
- LDA – Linear Discriminant Analysis – finds linear combination of features that can differentiate two or more classes of objects. GDA – Generalized Discriminant Analysis
Hyperparameter Tuning
- Grid Search – exhaustive searching through a manually specified subset of the hyperparameter space.
- Random Search – replaces the exhaustive enumeration of all combinations by selecting them randomly.
- Bayesian Optimization – uses regression to choose next values.
- Hyperband – only used to tune iterative algorithms, once they publish accuracy metrics after every epoch.
AWS Services
- Amazon Kinesis Data Firehose – real-time streaming who data; collects, processes and loads data to data lakes, warehouses and analytics services.
- Amazon Kinesis Data Streams – manual scaling, can store data, open-ended support.
- Amazon Personalize – fully managed, for personalized data and recommendations; continuous learning to improve performance.
- Amazon Forecast – fully managed, uses historical data for forecasting, resource planning, financial planning.
- Amazon Rekognition – search, verify and organize images and videos.