L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

DataFrames can be partially cached, but partitions cannot be partially cached. When you use cache() or persist(), DataFrame is not fully cached until you invoke an action that goes through each record e.g. count().

So if an action like take(1) is used, only one partition will be cached, because Catalyst understands that you don't need to calculate all partitions just to get one record.

#spark

187 views07:27

👍 5

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Let robots yell at people

As a technical lead, you are at your best as a productivity multiplier for your entire team. Anything you can do to increase your team's velocity has compounding effects for the company. Think automation/instrumentation/documentation. (rephrasing some twitter guy, not my words)

For example, often I am debugging the code with the built-in print statements in Python. Yes, I rarely use pdb and all that stuff and I don't suffer much. And sometimes I might forget to remove these debugging print lines from the PR. So as to avoid my colleagues constantly coming to me after a review and blaming me, we can create a linter rule for that. This helps not to screw up a relationship with anybody, and it is more effective to prevent repeated mistakes and to train newcomers.

What do you think?

#dev #management

166 views07:10

👍 2

Open Comments

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

How are you debugging in Python?

Anonymous Poll

pycharm/spider/etc debugger

breakpoint()

coding in REPL/Jupyter/Ipython

other

31 voters174 views07:12

👍 2

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

I expect that Airflow for you is not just a word expressing the fart. Often Airflow DAGs become too big and complicated to understand. They get split between different teams within a company for future implementation and support. It may end up with a problem of incorporating different DAGs into one pipeline.

https://luminousmen.com/post/airflow-dag-dependencies

#big_data

190 viewsedited 09:12

👍 3

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

In Wired, a cool article about how ML combined with good physical stimulation leads to an order of magnitude faster experiments. Here it drives the development of new batteries. My respect.

https://www.wired.com/story/ai-is-throwing-battery-development-into-overdrive/

#stuff

Wired

AI Is Throwing Battery Development Into Overdrive

Improving batteries has always been hampered by slow experimentation and discovery processes. Machine learning is speeding it up by orders of magnitude.

205 viewsedited 07:10

👍

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

The University of New York has made available a course on deep learning (https://cds.nyu.edu/deep-learning/) from Yann LeCun.

He is one of the creators of modern deep learning, a Turing award winner and former head of AI development on Facebook.

180 views07:10

👍 3

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

You don't need Hadoop

In the article the author writes that there is no need to use Hadoop on 600MB data. Despite the fact that the author made a clear emphasis on the size of data and his point is understandable, in my opinion, the size of data is insignificant in comparison with their structure and query patterns. Yes, Big Data is when the data don't fit on one machine. But not every task that requires horizontal scaling is big data.

So, if the data is trivial for processing and 100 terabytes for Hadoop may be small. It depends on what you do and how you do it.

Everything boils down to a trivial conclusion - use a tool suitable for the task.

#big_data

184 viewsedited 07:10

👍 3

Open Comments

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Apple released Tensorflow fork with support for a new chip M1 on macOS Big Sur. Thus, Tensorflow received a version for its own Apple processor, with support for hardware acceleration. Users can now get up to 7x faster training on the new 13-inch MacBook Pro with M1.

I like the idea of moving ML closer to the user, that makes a lot of sense especially for Apple.

Link

#ml

172 viewsedited 07:10

👍 1

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Evolution of trust

COVID is still brutal on the streets, but there is an epidemic that continues to rage even longer. Using game theory, the author tries to explain the epidemic of mistrust and seek a way to fix it.

https://ncase.me/trust/

#soft_skills

ncase.me

The Evolution of Trust

an interactive guide to the game theory of why & how we trust each other

170 views07:10

👍 3

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

185 views09:12

👍 6

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

We all die alone because of the microservices

https://youtu.be/y8OnoxKotPQ

YouTube

Microservices

it's because of the way our backend works

// more krazam scenes: https://www.patreon.com/KRAZAM

// merch: https://merch.krazam.tv

// https://www.instagram.com/krazam.tv

// https://twitter.com/krazamtv

195 views07:10

👍

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Today we gonna deep dive into Spark memory management

https://luminousmen.com/post/dive-into-spark-memory

Blog | iamluminousmen

Deep Dive into Spark Memory Management

Discover why your Spark cluster is losing money with a deep dive into Spark memory management. Uncover the complexities of memory allocation, off-heap memory, and task management for optimal performance.

188 views09:12

👍 6

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

If you have been writing Spark applications for a long time, you couldn't help but come across the tuning of its configuration parameters. Now you can do it automatically with a tool that optimizes the cluster resources for you. Made by me for you.

http://spark-configuration.luminousmen.com/

#spark #big_data

Apache Spark Configuration Optimization

Apache Apache Spark Configuration Optimization

Tool for automatic Apache Spark cluster resource optimization

222 views07:10

👍 4

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

190 views07:10

👍 7

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Who here in any professional way connected or interested in DS/ML?

Anonymous Poll

I'm never did any, but interested

26 voters162 views07:10

👍

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

AWS S3 is Now Strongly Consistent!

Effective immediately, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent.

https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/

#aws

Amazon

Amazon S3 Update – Strong Read-After-Write Consistency | Amazon Web Services

When we launched S3 back in 2006, I discussed its virtually unlimited capacity (“…easily store any number of blocks…”), the fact that it was designed to provide 99.99% availability, and that it offered durable storage, with data transparently stored in multiple…

195 views09:12

👍 4

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Preview version of Learning Spark with latest 3.0 release. I'm saving your time filling up the forms, enjoy the book.

#big_data #spark

161 views07:10

👍 3

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

LearningSpark2e.pdf

11.3 MB

280 views07:10

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Remote AWS Certification Exam

https://luminousmen.com/post/remote-aws-certification-exam

Blog | iamluminousmen

Remote AWS Certification Exam

In this post, I will talk about my experience with AWS certification for Solution Architect Associate and how I prepared for it.

167 views09:12

👍 1

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

When I got my first computer, I had a Windows system installed on it. I don't know about you, but every time there was an error in Windows that meant nothing to me I went to the lower Internet to solve it. And I found out that there are many more interesting things on the lower Internet...

Windows has created a layer of programmers who can solve problems. Thanks to her for that.

But Windows OS is still a pile of horse shit.

That's why you can buy a new year's ugly sweater with it: https://gear.xbox.com/pages/windows

160 views07:10

👍 3

L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵

Andreessen Horowitz published a detailed guide on the state of architectures for data infrastructure.

It covers data sources, data ingestion and transformation, storage, historical (for analytics), predictive and outputs along with different tools that can be used for each as well as case studies from several large companies on their data infrastructure setups.

Highly recommended.

https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/

150 views07:10

👍 5

About

Blog

Apps

Platform