Environment Setup - Vertex AI for ML Operations [notebook 00]

StatMike
4 Jan 202224:21

TLDRIn this video, Mike guides viewers through setting up a project for machine learning workflows using Vertex AI on Google Cloud. He covers creating a project, enabling necessary APIs, and setting up a notebook instance to clone and run Jupyter notebooks from a GitHub repository. The video also touches on managing costs by deleting projects and individual resources.

Takeaways

  • 🌟 **Introduction to Series**: The video is part of a series focused on machine learning workflows, covering data preparation, model training, evaluation, and deployment.
  • 🚀 **Project Setup**: The series begins with setting up a Google Cloud project, which will contain all resources and is easily deletable to stop billing.
  • 📚 **Notebook Review Option**: Viewers can review the provided Jupyter notebooks directly on GitHub or clone them into their Google Cloud environment to run.
  • 💻 **Google Cloud Console**: A new browser tab is used to access the Google Cloud console for project creation and resource setup.
  • 🔗 **Repository Link**: Links are provided to streamline the setup process, such as creating a new project and enabling necessary APIs.
  • 📈 **Vertex AI & Workbench**: Vertex AI is enabled for the project, and Workbench is visited to host notebook instances, which are used to run the machine learning workflows.
  • 🔧 **Notebook Instance Creation**: A notebook instance is created without GPUs, using TensorFlow 2.3, chosen for its compatibility with the series' modeling techniques.
  • 📁 **Repository Cloning**: The GitHub repository is cloned into the newly created notebook instance to access and run the Jupyter notebooks.
  • 🔍 **BigQuery and Cloud Storage**: The script demonstrates how to use BigQuery to extract data into Google Cloud Storage, where it's saved as a CSV file.
  • 🛠️ **Package Management**: The video covers the installation of necessary Python packages for the workflow, including TensorFlow, CPflow, and Plotly.
  • 💡 **Cost Management**: Advice is given on how to manage costs, including deleting the project to eliminate all associated resources and billing.
  • ❓ **Q&A and Feedback**: The video concludes with an invitation for viewers to ask questions, provide feedback, and engage with the community for continuous improvement.

Q & A

  • What is the main purpose of the video?

    -The main purpose of the video is to guide viewers through setting up their project environment for a series of machine learning workflows using Jupyter notebooks, specifically focusing on Google Cloud's Vertex AI.

  • What is the first step in setting up the project environment?

    -The first step is to create a new project in the Google Cloud Console, which will act as a container for all the resources that will be set up during the machine learning workflows.

  • How does the speaker suggest minimizing costs when using Google Cloud resources?

    -The speaker suggests two ways to minimize costs: 1) Using the free credit provided to new Google Cloud users for experimentation, and 2) Deleting the entire project, which will remove all created resources and stop any further billing.

  • What is the role of Vertex AI in the machine learning workflow presented in the video?

    -Vertex AI is used for enabling various machine learning services such as hosting notebook instances and managing model deployments. It is a central component in the machine learning operations journey.

  • How does the speaker plan to use BigQuery in the workflow?

    -The speaker plans to use BigQuery to extract data from a public dataset and save it as a CSV file in Google Cloud Storage, which will then be used for further analysis in the Jupyter notebooks.

  • What version of TensorFlow is used in the notebook instance?

    -The speaker chooses to use TensorFlow version 2.3 in the notebook instance, as it is suitable for the modeling techniques that will be demonstrated in the series.

  • How does the speaker handle the cloning of the GitHub repository into the notebook instance?

    -The speaker uses the HTTPS link from the GitHub repository to clone it into the notebook instance. This is done through the JupyterLab interface by using the 'Get' or 'Clone' option and pasting the copied link.

  • What is the significance of creating a storage bucket in Google Cloud Storage?

    -Creating a storage bucket in Google Cloud Storage is significant as it provides a place to store files and datasets. In the context of the video, it is used to store the CSV file extracted from BigQuery.

  • What are the additional packages installed for the machine learning workflow?

    -The additional packages installed include `cuflow-pipelines` for orchestration, `plotly` for interactive graphing, and an update to the `ai-platform` module for interacting with Vertex AI.

  • How does the speaker ensure that the setup is cost-effective?

    -The speaker ensures cost-effectiveness by choosing a small compute instance without GPUs, using a public dataset, and providing instructions on how to delete the project to stop incurring costs after the experiment.

  • What is the final step suggested by the speaker for users to engage with the content?

    -The final step suggested by the speaker is to provide feedback, ask questions, or suggest improvements through the GitHub repository where the notebooks are hosted.

Outlines

00:00

🚀 Project Setup and Workflow Overview

The paragraph introduces Mike, a statistician and Googler, who is passionate about learning and sharing. He welcomes viewers to his office and sets the stage for the series by explaining that it will cover end-to-end machine learning workflows encapsulated in Jupyter notebooks. The workflows include data acquisition, preparation, model training, evaluation, and deployment, with a possibility of automating the entire process. Mike guides viewers on how to set up the project, either by reviewing the files on GitHub or by creating a Google Cloud project, setting up resources, and working within a Google Cloud environment. He emphasizes the ease of deletion and cost-effectiveness of the project, especially for new users who receive a credit upon signing up.

05:02

📚 Creating a Notebook Instance and Cloning the Repository

In this section, Mike delves into the specifics of creating a notebook instance in Google Cloud's Workbench and the importance of selecting the right environment for the project. He chooses TensorFlow 2.3 without GPU support, considering the simplicity and speed of the modeling techniques used in the series. Mike provides a step-by-step guide on creating a new notebook, selecting the appropriate settings, and cloning the GitHub repository containing the notebooks. He also discusses the ease of reviewing the notebooks on GitHub and the benefits of running them in a Google Cloud environment, including the convenience of having everything set up and authenticated for interaction with Google Cloud services.

10:03

🛠️ Setting Up the Data and Environment

Mike continues the setup process by explaining how to create a storage bucket in Google Cloud Storage for storing data and the importance of defining the project and region for the upcoming tasks. He details the process of extracting data from a Big Query public dataset and saving it as a CSV file in the cloud storage. Mike also discusses the use of different libraries and packages such as pandas and sklearn for future interactions and数据处理. The paragraph concludes with Mike demonstrating the successful creation of the storage bucket and the extraction of the CSV file, setting the stage for further data manipulation and analysis in subsequent videos.

15:05

🔄 Updating Modules and Cleaning Up Resources

This paragraph focuses on the final steps of the setup process, including updating key modules for interaction with Google Cloud's Vertex AI and installing additional packages for data orchestration and visualization. Mike highlights the importance of keeping the environment up-to-date and compatible with the tools used in the series. He also addresses the concern of cost associated with using Google Cloud resources, providing clear instructions on how to avoid charges by either deleting the entire project or removing individual resources. Mike emphasizes the ease of cleanup and encourages viewers to provide feedback and contribute to the GitHub repository for continuous improvement and collaboration.

20:08

🤝 Encouraging Collaboration and Feedback

In the concluding paragraph, Mike invites viewer engagement by encouraging them to like, subscribe, and leave comments if they found the content helpful. He reiterates the importance of collaboration and feedback in improving the practice of AI and machine learning. Mike extends an invitation for viewers to contribute their ideas, corrections, or improvements to the GitHub repository, fostering a more connected and inclusive community. He ends the video on a positive note, expressing gratitude for the viewers' time and participation, and looks forward to working together to enhance the understanding and accessibility of AI and ML for a broader audience.

Mindmap

Keywords

💡Environment Setup

The process of configuring the necessary software, tools, and services required for a project. In the context of the video, environment setup refers to the initial preparation for machine learning operations using Vertex AI, which includes creating a Google Cloud project, enabling APIs, and setting up a notebook instance.

💡Jupyter Notebooks

Jupyter Notebooks are interactive computing environments that allow users to create and share documents containing live code, equations, visualizations, and narrative text. They are widely used in data science and machine learning for prototyping, analysis, and demonstration of algorithms. In the video, Jupyter notebooks encapsulate end-to-end machine learning workflows, providing a platform to execute and iterate on the code.

💡Machine Learning Workflows

Machine learning workflows refer to the series of steps taken to build, train, and deploy machine learning models. These workflows typically include data collection, preprocessing, model selection, training, evaluation, and deployment. In the video, the author discusses setting up an environment to execute such workflows using Vertex AI for ML operations.

💡Google Cloud Platform (GCP)

Google Cloud Platform is a suite of cloud computing services offered by Google, which includes various products for computing, data storage, and networking. In the video, GCP is used as the infrastructure to host and run the machine learning environment, enabling the creation of projects, storage buckets, and notebook instances.

💡Vertex AI

Vertex AI is a Google Cloud service that provides end-to-end machine learning platform capabilities for building, deploying, and managing machine learning models. It simplifies the machine learning lifecycle by automating tasks such as model training and deployment. In the video, Vertex AI is a central component of the machine learning operations environment setup.

💡Notebook Instance

A notebook instance is a virtual machine specifically designed to run Jupyter notebooks. It comes pre-installed with various software and libraries needed for data analysis and machine learning. In the video, the creation of a notebook instance is a key step to access an environment where the audience can execute the provided Jupyter notebooks.

💡Cloud Storage

Cloud Storage is a service provided by Google Cloud Platform for storing and accessing data in the cloud. It serves as an object storage system that allows for the storage and retrieval of arbitrary data blobs and metadata. In the video, cloud storage is used to store datasets and other files required for the machine learning workflows.

💡BigQuery

BigQuery is a fully managed, serverless data warehouse solution provided by Google Cloud Platform. It enables scalable analysis over petabytes of data with high performance and low latency. In the video, BigQuery is used as a source to extract a public dataset for the machine learning workflows.

💡TensorFlow

TensorFlow is an open-source software library for machine learning, developed by the Google Brain team. It provides a comprehensive ecosystem of tools, libraries, and community resources that lets researchers and developers build and deploy machine learning applications. In the video, TensorFlow 2.3 is selected as the machine learning framework for running the notebooks.

💡APIs

APIs, or Application Programming Interfaces, are protocols and tools for building software applications. They allow different software systems to communicate with each other, enabling the integration of functionality and data. In the video, APIs refer to the services provided by Google Cloud Platform that need to be enabled for the project to function correctly.

Highlights

Mike, a statistician and Googler, introduces the series on machine learning workflows using Vertex AI for ML Operations.

The project involves end-to-end machine learning workflows encapsulated in Jupyter notebooks.

The series covers data preparation, model training, evaluation, deployment, and automation.

Google Cloud environment is used to run the notebooks, with a new project created for the purpose.

A new user managed notebook instance is set up with TensorFlow 2.3 for the series.

The project is named 'stat mik demo 3' and is located in the US Central region.

Google Cloud Storage bucket is created for storing data and model artifacts.

Public dataset from BigQuery is used for demonstration - 'ulb fraud detection'.

Data extraction from BigQuery to the storage bucket is orchestrated using Jupyter notebooks.

Google Cloud's AI Platform is updated to interact with the latest modules and components.

Cost management is discussed, with options to delete the project or individual resources to avoid charges.

The series aims to make AI and ML practices more collaborative, accurate, and approachable.

GitHub repository is mentioned for further resources, corrections, and improvements.

The video provides a step-by-step guide for setting up the environment for ML workflows.

The use of JupyterLab is highlighted for running and managing notebooks.

Pipeline components for Google Cloud are installed and updated for orchestration.

Plotly is installed for interactive graphing in future notebooks.

The setup process is designed to be cost-effective and easy to clean up after experimentation.