If you don't know where to start, your best bet is to ask your web developer/DevOps (customer) if they can grant you access to raw server logs via FTP, ideally without any filtering. Here Industry Email List vare general guidelines for finding and managing log data on the three most common servers: Apache log files (Linux) NGINX log files (Linux) IIS Log Files (Windows) We will Industry Email List be using raw Apache files in this project. Why pandas alone is not enough for log analysis Pandas (an open-source data manipulation tool built with Python) is pretty ubiquitous in data science.
Slicing and slicing tabular data structures is a must, and Mammal works like a charm when the data fits in memory! In other words, a few gigabytes. But Industry Email List not terabytes. Besides parallel computing a database is usually a Industry Email List better solution for big data tasks that don't fit in memory. With a database, we can work with datasets that consume terabytes of disk space. Everything can be queried (via SQL), accessed and updated in no time! In this article, we will query our raw log data programmatically in
Python via Google Big Query. It's easy to use, affordable, and blazing fast - even on terabytes of data! The Python/Big Query combo also lets you query files Industry Email List stored on Google Cloud Storage. Sweet! If Google is a no-no for you and you want to try alternatives, Amazon and Microsoft also offer cloud data warehouses. They also integrate well with Python: Amazon: AWS S3 Redshift Microsoft: Azure Storage Azure Data Warehouse Azure Synapse Create a Industry Email List GCP account and configure Cloud Storage